multiple sequence composition alignment
DESCRIPTION
Multiple Sequence Composition Alignment. Name: Yip Chi Kin Date: 21-12-2006. Studied Papers. [B03] Composition Alignment. [S98] Divide-and-conquer Alignment. [M99] DIALIGN Algorithm. [SMS03] DCA + Segment-based. Main Aspects. ․Dynamic Programming - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/1.jpg)
Multiple Sequence
Composition Alignment
Name: Yip Chi KinDate: 21-12-2006
![Page 2: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/2.jpg)
Studied Papers
[SMS03] DCA + Segment-based
[B03] Composition Alignment
[M99] DIALIGN Algorithm
[S98] Divide-and-conquer Alignment
![Page 3: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/3.jpg)
Main Aspects․Dynamic Programming․Composition Alignment․Meta-code MSA․Simultaneous MSA
Pairwise Library (Global & Local) Consistency & UngappedDivide-and-conquerSegment-based (Optimal scores)
![Page 4: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/4.jpg)
Dynamic Programming DP Matrix C T G A
CTGA
••
•
•
Dot Matrix
max, jiS),(1,1 jiji basS
dS ji ,1
dS ji 1,
Edit GraphC T G
C
T
A
matches
deletions
insertions
1,1 jiS
jiS ,1
1, jiS
),( ji bas-d
-ds(ai,bi)
![Page 5: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/5.jpg)
Global Alignment
- C T T C T
-
G
C
A
T
C -20-3-4-7-10
-4-3-1-2-5-8
-5-5-3-2-3-6
-6-4-4-2-1-4
-9-7-5-3-1-2
-10-8-6-4-20
Needleman-Wunsch Algorithm
GA ResultsG C A T C -- C T T C T
Scoring
2),(),( 11 jiji basbas
1),( jiji basthenbaif1),( jiji basthenbaif
jSiS ji 2,2 ,00,
![Page 6: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/6.jpg)
Local Alignment
Smith-Waterman Algorithm
GA Results- G A A C – G G T - -T T T A C A G G C A G
135430002220
324651000000
312343200000
212012400000
130012120000
021102020000
200220000000
000000000000
- T T T A C A G G C A G-GAACGGT
2),(),( 11 jiji basbas
2),( jiji basthenbaif1),( jiji basthenbaif
Scoring0,00, ji SS
![Page 7: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/7.jpg)
MSA Methods․Consistency-based․Exact method․Progressive method․Iterative method․Stochastic method․Hidden Markov method
![Page 8: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/8.jpg)
MSA Concepts
C - G T C TC T G T C C
C - - T G T C CC G A T A T - T
C G - - - T C TC G A T A T - T
PSAs
Trace formulation
C
T
C
C
T
T
G
G
T
C
C T
T
A
C
TT
T
A
G
G
C
CT
G
A
T
T
C
T
C
A
T
G
C
C
C
T
C
C
T
T
G
G
T
C
C
Latter formulation
T
G
A
T
T
C
T
C
A
T
G
C
C T
G
A
T
T
T
T
C
A
C
GC
Consistency-based method
![Page 9: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/9.jpg)
MSA Results
Aligned regions
C
T
C
C
T
T
G
G
T
C
C T
G
A
T
T
C
T
C
A
T
G
C
C T
G
A
T
T
T
T
C
A
C
GC
T
T
C
C
T
T
G
G
C
C
T
G
A
T
T
C
T
C
A
T
G
C
C
Results of MSA
G T CCT--C
G T TC---C
A T T-TAGC
G T TC---C
Unrealized Consistent
Realized
![Page 10: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/10.jpg)
Divide-and-conquer
S1
S2
S3
C1
C2
C3
S1C1
S2C2
S3C3
C1S1
C2S2
C3S3
Divide
Divide Divide
Align optimally
Concatenate
SuffixPrefix
![Page 11: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/11.jpg)
DP Distance
3 4 3 4 6 8 10 4 2 3 2 4 6 8 6 4 2 2 2 4 6 8 6 4 2 1 2 410 8 6 4 2 0 212 10 8 6 4 2 0
C T A T A C -
GTATC-
0 2 4 6 8 10 12 2 1 3 5 7 9 11 4 3 1 3 5 7 9 6 5 3 1 3 5 7 8 7 5 3 1 3 510 8 7 5 3 2 3
- C T A T A C
-GTATC
Wopt (prefix)
Wopt (suffix)CS1,S2[C1,C2] = Wopt (prefix) + Wopt (suffix) – Wopt (total)
Sequence: GTTCATGCCAGGTGTAAATC
SuffixPrefix
![Page 12: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/12.jpg)
Additional-cost
CS1,S2[2,2] = 1 + 2 – 3 = 0
= Wopt [CT,GT] + Wopt [ATAC,ATAC] – Wopt [CTATC,GTATAC]
CS1,S2[4,3] = 3 + 1 – 3 = 1
= Wopt [CTAT,GTA] + Wopt [AC,TAC] – Wopt [CTATC,GTATAC]
0 3 4 7 11 15 19 3 0 3 4 8 12 16 7 4 0 2 4 8 1211 8 4 0 1 4 815 12 8 4 0 0 419 15 12 8 4 1 0
C T A T A C
GTATC
CS1,S2[1,1] = 0
CS1,S2[2,2] = 0
CS1,S2[3,3] = 0
CS1,S2[4,4] = 0
CS1,S2[5,4] = 0
CS1,S2[6,5] = 0
Cost of Diagonal
![Page 13: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/13.jpg)
Space & Time‘Chain’ of boxes
along Diagonal in order to reduce searching time
Full sequence searching
![Page 14: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/14.jpg)
DIALIGN
y I A - V L F - A E d
- L A c V I F - G s -
p w d d V T F d A E -
GA ResultsConsistent diagonals
Non-Consistent (Simultaneous)
Non-Consistent (Cross over)
I A V L F A E D
L A C V I F G S
P W D D V T D A EF
Y
I A V L F A E D
L A C V I F G S
P W D D V T D A EF
Y I A V L F A E D
L A C V I F G S
P W D D V T D A EF
Y
![Page 15: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/15.jpg)
WeightingDiagonal Weightsw(D) = – log P(lD, SD)
where SD is sum of similarity values of same diagonal lD
lD is length of diagonal D
Overlap weighting Y I A V L F A Y D D
L A C V I F G S
S W D D V M F Y A E
Y I A V L F A Y D D
L A C V I F G S
S W D D V M F Y A E
Y I A V L F A Y D D
L A C V I F G S
S W D D V M F Y A E
Diagonals D1 , D4 and D5
Score = 1.9 + 2.6 + 0.2 = 4.7
Diagonals D1 , D2 , D3 and D5
Score = 1.9 + 1.7 + 1.5 + 0.2 = 5.3
w(D1) = 1.9
w(D2) = 1.7
w(D3) = 1.5
w(D4) = 2.6
w(D5) = 0.2
![Page 16: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/16.jpg)
Consistency check1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
S1
S2
S3
f2
f1
f3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
S1
S2
S3
Overlap weights
Fragments checking
Transitivity frontier [1,9]
![Page 17: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/17.jpg)
GreedyStrategy
Greedy ApproachTandem duplications
Consistency conflicts
M2
M1 (2)
M1 (1)
M1 (2)
M1
(1)
M2(1)
M2
(2)
M1 (2)
M1
(1)
M2
M3
M1 (2)
M1
(1)
M2
M3
S1
S2
S3
S1
S2
S3
S1
S2
S1
S2
![Page 18: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/18.jpg)
Composition Alignment Single
character match
Composition matches
A A C G T C T T T G A G C T C
A G C C T G A C T - G C C T A
0 1 0 1 1 1 0 1 0 0 0 1 0 0 00 1 0 0 0 1 1 0 1 1 1 0 1 1 1
+ + – + – – – + – – –0 0 0 1 2 2 1 2 1 0 -1 0 -1 -2 -3
CM of Prefix Length
Matchin
g Prefix length
Sequence #1Sequence #2
![Page 19: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/19.jpg)
Match Length
Prefix length
CompositionMatching
2
1
–1
–2
0
–3
94 15
3
2
2 2
2
7
Replaced by 7
Replaced by 2
111010001001101110
1110100 010011011 10
Replaced by
![Page 20: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/20.jpg)
Composition Matching
0 1 0 1 1 1 0 1 0 0 0 1 0 0 0
0 1 0 0 0 1 1 0 1 1 1 0 1 1 1
CM of Prefix Length (Total=9)Sequence
#1
Sequence #2
0 1 0 1 1 1 0 1 0 0 0 1 0 0 0
0 1 0 0 0 1 1 0 1 1 1 0 1 1 1
Sequence #1
Sequence #2
0 1 0 1 1 1 0 1 0 0 0 1 0 0 0
0 1 0 0 0 1 1 0 1 1 1 0 1 1 1
Sequence #1
Sequence #2
CM = 2
CM = 1
CM = 0
CM = -1
![Page 21: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/21.jpg)
Meta-Code
Code about code
Meta-CodeOriginal Code
Mismatch code
Matchcode
Code forTesting
Input code
Control Rule
Code ReservoirMismatch
code
![Page 22: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/22.jpg)
Reservoir Codes
Code ‘CT’
Store code in Reservoir S2
Code from S1
Code from S2
Store code in Reservoir S1
If both Codes founded from Reservoir S1
and Reservoir S2 delete this
two codes
Reservoir Code (e.g. AGRCT)
Code ‘G’
Code ‘G’
Code ‘C’
Code ‘C’
Code ‘AG’
Code ‘A’ in S1
Code ‘T’ in S2
![Page 23: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/23.jpg)
Meta-Code Rule
Meta-code (e.g. AMT)Codes
from S1 and S2
Copy the codes fromS1 and S2, p = p –1,output meta-code.
If CM length is valid,reservoir code = r,
Position = p.
Values ofr and p
Value of r
If reservoir code = r,then stop the looping
Looping for creating
meta-code
![Page 24: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/24.jpg)
CM (Lengths & Codes)
A A A A AG
A G GA
GA
AG
AG
T T T T TC
C C CT
CT
TT
CC
Reservoir codes in S1
S1: S2:
MetaCode
Length
T A C T C G G A CT T C G C C A T C
0 1 1 1 1 2 1 2 2R ART ART ART ART AGRCT GRC GARTC GARTC
GT
2AGRTT
TC
2AGRCC
CG
1ARC
Reservoir codes in S2
Composition Matching of S1 and S2 in prefix length
![Page 25: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/25.jpg)
CM of Metacode
Prefix length
CompositionMatching
0
2
1
–1
106 12
2 4GARTCART ART AGRCT
ART ARCART
AGRTT GARTC
Invalid length
Invalid length
2
![Page 26: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/26.jpg)
Composition MSA
T A C G T C G T C G A CT T C T G C C C G A T CT T C TMG GMT C C CMT GMC AMG TMA CT T C G T C C T C G A C
Composition matching
Meta-code MSA
New
S1
S2
S2
| | | |T T C T G C C C G A T C
T A C G T C G T C G A C
![Page 27: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/27.jpg)
Fixed SegmentA = Currency / CardsB = Stock / Structured P.C = Unit Trusts / BondsD = Insurance / Finance
Code catalogue
E = Mortgages / Loans
Week #1 Week #2
Branch bank #1 …Branch bank #2 …Branch bank #3 …
A C B C E B A A E B C A
B E B A A A C E D B E A
A C A A B C E E D B E E
Time Granularities
…1t1 1t2 1t3 1t4 1t5 2t1 2t2 2t3 2t4 2t5 3t1 3t2
․Weekly behaviourSegment Length LS =
5
․Semi-global alignment․Least overlap problem․Simple segmentation․Composition alignment
![Page 28: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/28.jpg)
Family Classifications
Fixed-Segment Composition MSA
Branch bank #1Branch bank #2Branch bank #3
C C B C A D
A A B A D C
A B A A B BCompositionalignment
Family
Group
Meta-Code Branch bank #1Branch bank #2
Meta-Code Branch bank #3
C C B A D C
A A B A D C
A A B A B BPSA
Family
Group
![Page 29: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/29.jpg)
Further Problems
․Fixed-segment length․Prior sequence choice․Speed-up PSAs․Nos. of Segments/Codes
Meta-Code Composition MSA
![Page 30: Multiple Sequence Composition Alignment](https://reader035.vdocuments.us/reader035/viewer/2022081419/56814d71550346895dbac6f7/html5/thumbnails/30.jpg)
Conclusions․Fixed-segment Composition (Least Overlap Problems)
․Meta-code Approach (Easier Transform Applications)
․Widespread use of MSA (Simultaneous Multiple Sequences)