bio medical tics - sequence analysis- alignment- 2011

Post on 07-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    1/96

    1

    D N A S E Q U E N C I N G

    & S E Q U E N C E A L I G N M E N T

    Sequence Analysis

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    2/96

    2

    Sequence Alignment

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    3/96

    3

    Study Goal

    What is a sequence alignment?

    The difference between a global and local alignment and what theuses of each are.

    How to use the dot matrix methods to analyze genes andchromosomes

    The steps performed by the Needleman-Wunsh and Smith-Waterman Algorithms to produce a sequence alignment

    How to use scoring matrix values and gap penalties to produce asequence alignment

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    4/96

    4

    Sequence Alignment

    Definition: the comparison of two or more sequences bysearching for a series of individual characters or characterpatterns that are in the same order in the sequences

    Sequence alignment score:

    Sum of the individual log odds scores for each pair of alignedsequence characters in an alignment less a penalty for each gap ofone more position

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    5/96

    5

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    6/96

    6

    Pair-wise Sequence Alignment

    -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

    TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

    DefinitionGiven two strings x = x1x2...xM, y = y1y2yN,

    an alignment is an assignment of gaps to positions0,, N in x, and 0,, N in y, so as to line up eachletter in one sequence with either a letter, or a gapin the other sequence

    AGGCTATCACCTGACCTCCAGGCCGATGCCC

    TAGCTATCACGACCGCGGTCGATTTGCCCGAC

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    7/96

    7

    What is a good alignment?

    AGGCTAGTT, AGCGAAGTTT

    AGGCTAGTT- 6 matches, 3 m ismatches, 1 gap

    AGCGAAGTTT

    AGGCTA-GTT- 7 m atches, 1 m isma tch, 3 gaps

    AG-CGAAGTTT

    AGGC-TA-GTT- 7 m atches, 0 m isma tches, 5 gapsAG-CG-AAGTTT

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    8/96

    8

    Evolution at the DNA level

    SEQUENCE EDITSACGGTGCAGTTACCA

    AC----CAGTCCACCA

    MutationDeletion

    REARRANGEMENTS

    InversionTranslocationDuplication

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    9/96

    9

    Evolutionary

    Still OK?

    OK

    OK

    OK

    X

    X

    next generation

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    10/96

    10

    Scor ing Function

    Sequence edits: AGGCCTC

    Mutations AGGACTC

    Insertions AGGGCCTC Deletions AGG . CTC

    Scor ing Function: Match: +m

    Mismatch: -s

    Gap: -d

    Score F = (# matches) m - (# mismatches) s(#gaps) d

    Alternative definition:

    m in im a l ed i t d i st a n ce

    Given two strings x, y,find minimum # of edits

    (insertions, deletions, mutations)

    to transform one string to theother

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    11/96

    11

    Pair-wise Alignment

    The alignment of two sequences (DNA or protein) is

    a relatively straightforward computational problem. There are lots of possible alignments.

    Two sequences can always be aligned.

    Sequence alignments have to be scored. Often there is m ore than one solution with the

    same score.

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    12/96

    12

    How do we compute the best alignment?

    AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

    AGTGACCTGGGAAGACC

    CTGACCCTGGGTCACAAAACTC

    Too many possible

    alignments:

    >> 2N

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    13/96

    13

    Local alignment vs. Global Alignment

    Use to align the entire sequence

    Best for same length sequence

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    14/96

    14

    Local alignment vs. Global Alignment

    Use to align the similar sequence along certain length

    Best for sequence sharing a conserved region or domain

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    15/96

    15

    What do we want alignment for?

    Xenologous transferred genes

    Orthologous

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    16/96

    16

    Matches to similar sequences

    Sequence conservation Structure conservation

    function conservation What is conserved in a gene [protein] family is functionally important

    Due to purifying selection driven by functional constraints observable in abckground described by the theory of neutral evolution

    Fast enough that pseudogenes rapidly deteriorate over evolutionarytimescale

    In any prokaryotic genome, homologs from more than one distantly relatedspecies are detectable for 70 80 % of proteins

    Application: Comparison of sequence/structures can identifyhomologous relationships, allowing inference of function based on thatrelationship

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    17/96

    17

    Methods of Alignment

    By hand - slide sequences on two lines of a word

    processor Dot plot

    with windows

    Rigorous mathematical approach Dynamic programming (slow, optimal)

    Heuristic methods (fast, approximate)

    BLAST and FASTAWord matching and hash tables0

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    18/96

    18

    Align by Hand

    GATCGCCTA_TTACGTCCTGGAC AGGCATACGTA_GCCCTTTCGC

    You still need some kind of scoring system to find thebest alignment

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    19/96

    Dotplot19

    A

    T

    T

    C

    A C

    A

    T

    A

    T A C A T T A C G T A C

    Sequence 1

    Sequence 2

    A dotplot gives an overview of all possible alignments

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    20/96

    Dotplot20

    A

    T

    T

    C

    A

    C A

    T

    A

    T A C A T T A C G T A C

    T A C A T T A C G T A C

    A T A C A C T T A

    Sequence 1

    Sequence 2

    One possible alignment:

    In a dotplot each diagonal corresponds to a possible(ungapped) alignment

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    21/96

    21Insertions / Deletions in a DotplotT

    A

    CT

    G

    T

    C

    A

    T

    T A C T G T T C A TSequence 1

    Sequence 2

    T A C T G - T C A T| | | | | | | | |

    T A C T G T T C A T

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    22/96

    22

    Hemoglobin -chain

    Hemoglobin

    -chain

    Dotplot(Window = 130 / Stringency = 9)

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    23/96

    23Word Size Algorithm

    TA C G G T A T G

    A C AG T A T C

    T A C G G T A T G

    A C A G T A T C

    T A C G G T A T G

    A CA G T A T C

    T A C G G T AT G

    A C A G T AT C

    C

    T

    A

    T G

    A

    C

    A

    T A C G G T A T G

    Word Size = 3

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    24/96

    24

    PTHPLASKTQILPEDLASEDLTI

    PTHPLAGERAIGLARLAEEDFGM

    Score = 7

    PTHPLASKTQILPEDLASEDLTI

    PTHPLAGERAIGLARLAEEDFGM

    Score = 11

    Matrix: PAM250

    Window = 12

    Stringency = 9

    Scoring Matrix Filtering

    PTHPLASKTQILPEDLASEDLTI

    PTHPLAGERAIGLARLAEEDFGM

    Score = 11

    Window / Stringency

    Typical window size for DNA sequences is 15 bases, and stringency or matchrequirement in this window is 10, meaning if there are 10-15 matches within thewindow, and a dot is printed at the first base in the windows

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    25/96

    25Dotplot(Window = 18 / Stringency = 10)

    Hemoglobin

    -chain

    Hemoglobin -chain

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    26/96

    Considerations

    The window/stringency method is more sensitive than thewordsize method (ambiguities are permitted).

    The smaller the window, the larger the weight of statistical(unspecific) matches.

    With large windows the sensitivity for short sequences isreduced.

    Insertions/deletions are not treated explicitly.

    26

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    27/96

    27

    Dot Matrix

    Unless two sequences are known to be very much alike, thedot matrix method should be considered as a first choice for

    pair-wise sequence alignment Readily reveal the presence of insertions/deletions and

    direct and inverted repeats

    DNA sequence dot matrix comparison: long windows andhigh stringencies (7/11, 11/15)

    Protein sequences: short windows, stringencies (1/1) exceptfor short domain of partial similaritity in not similar

    sequences (15/5)

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    28/96

    Programs:

    DOTTER -

    http://www.cgb.ki.se/cgb/groups/sonnhammer/Dotter.html

    DOTPLOT - http://helix.nih.gov/docs/gcg/dotplot.html

    PLALIGN in FASTA EMBOSS http://emboss.sourceforge.net/download/

    28

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    29/96

    29

    Methods of Alignment

    By hand - slide sequences on two lines of a wordprocessor

    Dot plot

    with windows

    Rigor ous m athem atical appro ach Dynam ic progr am m ing (slow , optim al)

    Heuristic methods (fast, approximate)

    BLAST and FASTAWord matching and hash tables0

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    30/96

    30

    Alignment is additive

    Observation:

    The score of aligning x1xMy1yN

    is additive

    Say that x1xi xi+1xMaligns to y1yj yj+1yN

    The two scores add up:

    F(x[1:M], y[1:N]) = F(x[1:i], y[1:j]) + F(x[i+1:M], y[j+1:N])

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    31/96

    31

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    32/96

    Basic pr inciples of dynam ic program m ing32

    - Creation of an alignment path matrix

    - Stepwise calculation of score values

    - Backtracking (evaluation of the optimal path)

    It is applicable when a large search space can be structuredinto a succession of stages, such that the initial stage contains

    trivial solutions to sub-problems, each partial solution in a later

    stage can be calculated by recurring a fixed number of partial

    solutions in an earlier stage, the final stage contains the overallsolution - Mount

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    33/96

    33Creation of an alignment path matrix

    Idea:

    Build up an optimal alignment using previous solutions foroptimal alignments of smaller subsequences

    Construct matrix Findexed by iandj(one index for each sequence)

    F(i,j) is the score of the best alignment between the initial

    segment x1...iof xup to xi and the initial segment y1...j of yup to

    yj

    Build F(i,j) recursively beginning with F(0,0) = 0

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    34/96

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    35/96

    35

    If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can

    calculate F(i,j) Three possibilities:

    xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj)

    xi is aligned to a gap, F(i,j) = F(i-1,j) - d yj is aligned to a gap, F(i,j) = F(i,j-1) - d

    The best score up to (i,j) will be the largest of the threeoptions

    Creation of an alignment path matrix

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    36/96

    36

    Dynam ic Program m ing

    There are only a polynomial number of subproblems

    Align x1

    xi

    to y1

    yj

    Original problem is one of the subproblems

    Align x1xM to y1yN Each subproblem is easily solved from smaller

    subproblems

    ???

    Then, we can applyDynamic Programming!!!

    Let F(i,j) = optimal score of aligning

    x1xiy1yj

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    37/96

    37

    Example of Dynamic programming algorithm

    Sequences a1, a2, a3, a4 and b1, b2, b3, b4

    Align sequences in a global alignment

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    38/96

    38

    Example of Dynamic programming algorithm

    Sequences a1, a2, a3, a4 and b1, b2, b3, b4

    Align sequences in a global alignment

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    39/96

    39

    Example of Dynamic programming algorithm

    Sequences a1, a2, a3, a4 and b1, b2, b3, b4

    Align sequences in a global alignment

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    40/96

    40

    Example of Dynamic programming algorithm

    Sequences a1, a2, a3, a4 and b1, b2, b3, b4

    Align sequences in a global alignment

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    41/96

    41

    Example of Dynamic programming algorithm

    Sequences a1, a2, a3, a4 and b1, b2, b3, b4

    Align sequences in a global alignment

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    42/96

    42

    Example of Dynamic programming algorithm

    Sequences a1, a2, a3, a4 and b1, b2, b3, b4

    Align sequences in a global alignment

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    43/96

    43

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    44/96

    Types of Scores44

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    45/96

    Scoring Matr ix / Substitution Matr ix45

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    46/96

    Scoring Matrix: Example

    A R N K

    A 5 -2 -1 -1

    R - 7 -1 3

    N - - 7 0

    K - - - 6

    Notice that although R andK are different amino acids,

    they have a positive score.

    Why? They are bothpositively charged amino

    acidswill not greatlychange function of protein.

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    47/96

    Conservation

    Amino acid changes that tend to preserve thephysicochemical properties of the original residue

    Polar to polar

    aspartate glutamate

    Nonpolar to nonpolaralanine valine

    Similarly behaving residues

    leucine to isoleucine

    l

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    48/96

    Scoring Examples48

    i i

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    49/96

    49

    Dynam ic Program m ing

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    50/96

    50

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    51/96

    51

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    52/96

    52

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    53/96

    53

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    54/96

    54

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    55/96

    55

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    56/96

    56

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    57/96

    57

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    58/96

    58

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    59/96

    Scoring systems

    59

    DNA Scor ing System s -very sim ple

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    60/96

    DNA Scor ing System s -very sim ple60

    actaccagttcatttgatacttctcaaataccattaccgtgttaactgaaaggacttaaagact

    Sequence 1Sequence 2

    A G C T

    A 1 0 0 0

    G 0 1 0 0

    C 0 0 1 0

    T 0 0 0 1

    Match: 1Mismatch: 0

    Score = 5

    Protein Scoring System s

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    61/96

    Protein Scoring System s61

    PTHPLASKTQILPEDLASEDLTI

    PTHPLAGERAIGLARLAEEDFGM

    Sequence 1

    Sequence 2

    Scoringmatrix

    T:G = -2T:T = 5Score = 48

    C S T P A G N D . .

    C 9

    S -1 4

    T -1 1 5

    P -3 -1 -1 7

    A 0 1 0 -1 4

    G -3 0 -2 -2 0 6

    N -3 1 0 -2 -2 0 5

    D -3 0 -1 -1 -2 -1 1 6

    .

    .

    C S T P A G N D . .

    C 9

    S -1 4

    T -1 1 5

    P -3 -1 -1 7

    A 0 1 0 -1 4

    G -3 0 -2 -2 0 6

    N -3 1 0 -2 -2 0 5

    D -3 0 -1 -1 -2 -1 1 6

    .

    .

    Protein Scoring System s

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    62/96

    Protein Scoring System s62

    Amino acids have different biochemical and physical propertiesthat influence their relative replaceability in evolution.

    CP

    GGA

    VI

    L

    MF

    Y

    W H

    K

    R

    EQ

    DNS

    TCSH

    S+S

    positive

    charged

    polar

    aliphatic

    aromatic

    small

    tiny

    hydrophobic

    Protein Scoring System s

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    63/96

    Protein Scoring System s

    Scoring m atr ices r eflect:

    # of mu tations to convert one to another

    chem ical similarity

    observed m utation frequen cies

    the pr obability of occur rence of each am ino acid

    W idely used scoring m atrices:

    PAM

    BLOSUM

    63

    PAM 250

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    64/96

    64

    A R N D C Q E G H I L K M F P S T W Y V B Z

    A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1

    R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 1 2

    N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 4 3

    D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 5 4C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -3 -4

    Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 3 5

    E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 4 5

    G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 2 1

    H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 3 3

    I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1 -1

    L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -2 -1K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2

    M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -1 0

    F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4

    P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 1 1

    S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1

    T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 2 1

    W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4

    Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -2 -3

    V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 0 0

    B 2 1 4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2 0 6 5

    Z 1 2 3 4 -4 5 5 1 3 -1 -1 2 0 -4 1 1 1 -4 -3 0 5 6

    PAM 250

    C

    -8 17

    W

    W

    PAM

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    65/96

    65

    Dayhoff Matr ix

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    66/96

    y66

    Dayhoff Matr ix

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    67/96

    y67

    Dayhoff Matr ix

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    68/96

    y68

    BLOSUM Matrix

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    69/96

    69

    BLOSUM Matrix

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    70/96

    70

    AABCDA...BBCDA

    DABCDA.A.BBCBB

    BBBCDABA.BCCAA

    AAACDAC.DCBCDB

    CCBADAB.DBBDCC

    AAACAA...BBCCC

    Conserved blocks in alignments

    BLOSUM Matrix

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    71/96

    71

    AABCDA...BBCDA

    DABCDA.A.BBCBB

    BBBCDABA.BCCAA

    AAACDAC.DCBCDB

    CCBADAB.DBBDCC

    AAACAA...BBCCC

    Conserved blocks in alignments

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    72/96

    72

    TIPS on choosing a scoring matrix

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    73/96

    73

    S o c oos g a sco g at

    Generally, BLOSUM matrices perform better than PAM matricesfor local similarity searches (Henikoff & Henikoff, 1993).

    When comparing closely related proteins one should use lowerPAM or higher BLOSUM matrices, for distantly related proteinshigher PAM or lower BLOSUM matrices.

    For database searching the commonly used matrix is BLOSUM62.

    Significance of alignm ent

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    74/96

    74

    Significance of alignm ent

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    75/96

    75

    Database Sear ching

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    76/96

    76

    Database Sear ching

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    77/96

    77

    Local vs. Global Alignm ent

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    78/96

    Global Alignm ent

    Local Alignm entbetter alignm ent to find

    conserved segment

    --T-CC-C-AGT-TATGT-CAGGGGACACGA-GCATGCAGA-GAC

    | || | || | | | ||| || | | | | |||| |AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-CAGAT--C

    tccCAGTTATGTCAGgggacacgagcatgcagagac

    ||||||||||||

    aattgccgccgtcgttttcagCAGTTATGTCAGatc

    Global vs. Local Alignm ent

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    79/96

    79

    Global Alignment Needleman and Wunsch (1970)

    Local Alignment Smith and Waterman (1981a)

    Global Alignment

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    80/96

    needle (Needleman & Wunsch) creates an end-to-end

    alignment.

    Two closely related sequences:

    Global Alignment

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    81/96

    Two sequences sharing several regions of local similarity:

    1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAATTAAAGAGGAGGTAGACCG.... 67

    1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAGCACTAAAGCGTCAGCGAGACCG 70

    |||||||||||||| | | | |||| || | | | ||

    The Needlem an-W unsch Matr ix

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    82/96

    82

    x1 xMy1

    yN

    Every non-decreasing

    path

    from (0,0) to (M, N)

    corresponds toan alignment

    of the two sequences

    An optimal alignment is composed ofoptimal subalignments

    The Needlem an-W unsch Algor ithm

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    83/96

    831. Initialization.

    F(0, 0) = 0 , F(0, j) = - j d, F(i, 0)= - i d

    2. Main Iteration. Filling-in partial alignmentsa. For each i = 1M

    For each j = 1N

    F(i-1,j-1) + s(xi, yj ) [case 1]

    F(i, j) = max F(i-1, j) d [case 2]

    F(i, j-1) d [case 3]

    DIAG, if [case 1]

    Ptr(i,j) = LEFT, if [case 2]

    UP, if [case 3]

    3. Termination. F(M, N) is the optimal score, and

    from Ptr(M, N) can trace back optimal alignment

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    84/96

    Local Alignment (Smith-Waterman)

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    85/96

    Local alignment

    Identify the most similar sub-region shared between two

    sequences Smith-Waterman

    The local alignm ent problem

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    86/96

    86

    Given two strings x = x1xM,

    y = y1yN

    Find substrings x, ywhose similarity

    (optimal global alignment value)

    is maximum

    x = aaaacccccggggtta

    y = ttcccgggaaccaacc

    87

    W hy local alignm ent examples

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    87/96

    87

    Genes areshuffled between

    genomes

    Portions ofproteins(domains) areoften conserved

    88

    Cross-species genom e sim ilar ity

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    88/96

    88

    98% of genes are conserved between any two mammals

    >70% average similarity in protein sequence

    hum_a : GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------- @ 57331/400001

    mus_a : GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------- @ 78560/400001

    rat_a : GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------- @ 112658/369938

    fug_a : TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG @ 36008/68174

    hum_a : CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 57381/400001mus_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 78610/400001

    rat_a : CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG @ 112708/369938

    fug_a : TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG @ 36058/68174

    hum_a : AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT @ 57431/400001

    mus_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT @ 78659/400001

    rat_a : AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT @ 112757/369938

    fug_a : AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC @ 36084/68174

    hum_a : AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 57481/400001

    mus_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 78708/400001

    rat_a : AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG @ 112806/369938

    fug_a : CCGAGGACCCTGA------------------------------------- @ 36097/68174

    atoh enhancer inhuman, mouse, rat,fugu fish

    89

    The Sm ith-W aterm an algor ithm

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    89/96

    89

    Termination:

    1. If we want the best local alignment

    FOPT = maxi,j F(i, j)Find FOPT and trace back

    2. If we want all local alignments scoring > t

    For all i, j find F(i, j) > t, and trace backComplicated by overlapping local alignments

    (WatermanEggert 87: find all non-overlapping local alignments with minimal

    recalculation of the DP matrix)

    90

    Sm ith-w aterm an algor ithm

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    90/96

    90

    Basic Local Alignm ent Search Tool

    91

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    91/96

    91

    BLAST Algor ithm

    92

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    92/96

    92

    BLAST Algor ithm

    93

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    93/96

    93

    BLAST Algor ithm

    94

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    94/96

    94

    BLAST Algor ithm

    95

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    95/96

    95

    Basic BLAST Algorithms

    96

  • 8/4/2019 Bio Medical tics - Sequence Analysis- Alignment- 2011

    96/96