multiple sequence composition alignment
Post on 18-Jan-2016
68 Views
Preview:
DESCRIPTION
TRANSCRIPT
Multiple Sequence
Composition Alignment
Name: Yip Chi KinDate: 21-12-2006
Studied Papers
[SMS03] DCA + Segment-based
[B03] Composition Alignment
[M99] DIALIGN Algorithm
[S98] Divide-and-conquer Alignment
Main Aspects․Dynamic Programming․Composition Alignment․Meta-code MSA․Simultaneous MSA
Pairwise Library (Global & Local) Consistency & UngappedDivide-and-conquerSegment-based (Optimal scores)
Dynamic Programming DP Matrix C T G A
CTGA
••
•
•
Dot Matrix
max, jiS),(1,1 jiji basS
dS ji ,1
dS ji 1,
Edit GraphC T G
C
T
A
matches
deletions
insertions
1,1 jiS
jiS ,1
1, jiS
),( ji bas-d
-ds(ai,bi)
Global Alignment
- C T T C T
-
G
C
A
T
C -20-3-4-7-10
-4-3-1-2-5-8
-5-5-3-2-3-6
-6-4-4-2-1-4
-9-7-5-3-1-2
-10-8-6-4-20
Needleman-Wunsch Algorithm
GA ResultsG C A T C -- C T T C T
Scoring
2),(),( 11 jiji basbas
1),( jiji basthenbaif1),( jiji basthenbaif
jSiS ji 2,2 ,00,
Local Alignment
Smith-Waterman Algorithm
GA Results- G A A C – G G T - -T T T A C A G G C A G
135430002220
324651000000
312343200000
212012400000
130012120000
021102020000
200220000000
000000000000
- T T T A C A G G C A G-GAACGGT
2),(),( 11 jiji basbas
2),( jiji basthenbaif1),( jiji basthenbaif
Scoring0,00, ji SS
MSA Methods․Consistency-based․Exact method․Progressive method․Iterative method․Stochastic method․Hidden Markov method
MSA Concepts
C - G T C TC T G T C C
C - - T G T C CC G A T A T - T
C G - - - T C TC G A T A T - T
PSAs
Trace formulation
C
T
C
C
T
T
G
G
T
C
C T
T
A
C
TT
T
A
G
G
C
CT
G
A
T
T
C
T
C
A
T
G
C
C
C
T
C
C
T
T
G
G
T
C
C
Latter formulation
T
G
A
T
T
C
T
C
A
T
G
C
C T
G
A
T
T
T
T
C
A
C
GC
Consistency-based method
MSA Results
Aligned regions
C
T
C
C
T
T
G
G
T
C
C T
G
A
T
T
C
T
C
A
T
G
C
C T
G
A
T
T
T
T
C
A
C
GC
T
T
C
C
T
T
G
G
C
C
T
G
A
T
T
C
T
C
A
T
G
C
C
Results of MSA
G T CCT--C
G T TC---C
A T T-TAGC
G T TC---C
Unrealized Consistent
Realized
Divide-and-conquer
S1
S2
S3
C1
C2
C3
S1C1
S2C2
S3C3
C1S1
C2S2
C3S3
Divide
Divide Divide
Align optimally
Concatenate
SuffixPrefix
DP Distance
3 4 3 4 6 8 10 4 2 3 2 4 6 8 6 4 2 2 2 4 6 8 6 4 2 1 2 410 8 6 4 2 0 212 10 8 6 4 2 0
C T A T A C -
GTATC-
0 2 4 6 8 10 12 2 1 3 5 7 9 11 4 3 1 3 5 7 9 6 5 3 1 3 5 7 8 7 5 3 1 3 510 8 7 5 3 2 3
- C T A T A C
-GTATC
Wopt (prefix)
Wopt (suffix)CS1,S2[C1,C2] = Wopt (prefix) + Wopt (suffix) – Wopt (total)
Sequence: GTTCATGCCAGGTGTAAATC
SuffixPrefix
Additional-cost
CS1,S2[2,2] = 1 + 2 – 3 = 0
= Wopt [CT,GT] + Wopt [ATAC,ATAC] – Wopt [CTATC,GTATAC]
CS1,S2[4,3] = 3 + 1 – 3 = 1
= Wopt [CTAT,GTA] + Wopt [AC,TAC] – Wopt [CTATC,GTATAC]
0 3 4 7 11 15 19 3 0 3 4 8 12 16 7 4 0 2 4 8 1211 8 4 0 1 4 815 12 8 4 0 0 419 15 12 8 4 1 0
C T A T A C
GTATC
CS1,S2[1,1] = 0
CS1,S2[2,2] = 0
CS1,S2[3,3] = 0
CS1,S2[4,4] = 0
CS1,S2[5,4] = 0
CS1,S2[6,5] = 0
Cost of Diagonal
Space & Time‘Chain’ of boxes
along Diagonal in order to reduce searching time
Full sequence searching
DIALIGN
y I A - V L F - A E d
- L A c V I F - G s -
p w d d V T F d A E -
GA ResultsConsistent diagonals
Non-Consistent (Simultaneous)
Non-Consistent (Cross over)
I A V L F A E D
L A C V I F G S
P W D D V T D A EF
Y
I A V L F A E D
L A C V I F G S
P W D D V T D A EF
Y I A V L F A E D
L A C V I F G S
P W D D V T D A EF
Y
WeightingDiagonal Weightsw(D) = – log P(lD, SD)
where SD is sum of similarity values of same diagonal lD
lD is length of diagonal D
Overlap weighting Y I A V L F A Y D D
L A C V I F G S
S W D D V M F Y A E
Y I A V L F A Y D D
L A C V I F G S
S W D D V M F Y A E
Y I A V L F A Y D D
L A C V I F G S
S W D D V M F Y A E
Diagonals D1 , D4 and D5
Score = 1.9 + 2.6 + 0.2 = 4.7
Diagonals D1 , D2 , D3 and D5
Score = 1.9 + 1.7 + 1.5 + 0.2 = 5.3
w(D1) = 1.9
w(D2) = 1.7
w(D3) = 1.5
w(D4) = 2.6
w(D5) = 0.2
Consistency check1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
S1
S2
S3
f2
f1
f3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
S1
S2
S3
Overlap weights
Fragments checking
Transitivity frontier [1,9]
GreedyStrategy
Greedy ApproachTandem duplications
Consistency conflicts
M2
M1 (2)
M1 (1)
M1 (2)
M1
(1)
M2(1)
M2
(2)
M1 (2)
M1
(1)
M2
M3
M1 (2)
M1
(1)
M2
M3
S1
S2
S3
S1
S2
S3
S1
S2
S1
S2
Composition Alignment Single
character match
Composition matches
A A C G T C T T T G A G C T C
A G C C T G A C T - G C C T A
0 1 0 1 1 1 0 1 0 0 0 1 0 0 00 1 0 0 0 1 1 0 1 1 1 0 1 1 1
+ + – + – – – + – – –0 0 0 1 2 2 1 2 1 0 -1 0 -1 -2 -3
CM of Prefix Length
Matchin
g Prefix length
Sequence #1Sequence #2
Match Length
Prefix length
CompositionMatching
2
1
–1
–2
0
–3
94 15
3
2
2 2
2
7
Replaced by 7
Replaced by 2
111010001001101110
1110100 010011011 10
Replaced by
Composition Matching
0 1 0 1 1 1 0 1 0 0 0 1 0 0 0
0 1 0 0 0 1 1 0 1 1 1 0 1 1 1
CM of Prefix Length (Total=9)Sequence
#1
Sequence #2
0 1 0 1 1 1 0 1 0 0 0 1 0 0 0
0 1 0 0 0 1 1 0 1 1 1 0 1 1 1
Sequence #1
Sequence #2
0 1 0 1 1 1 0 1 0 0 0 1 0 0 0
0 1 0 0 0 1 1 0 1 1 1 0 1 1 1
Sequence #1
Sequence #2
CM = 2
CM = 1
CM = 0
CM = -1
Meta-Code
Code about code
Meta-CodeOriginal Code
Mismatch code
Matchcode
Code forTesting
Input code
Control Rule
Code ReservoirMismatch
code
Reservoir Codes
Code ‘CT’
Store code in Reservoir S2
Code from S1
Code from S2
Store code in Reservoir S1
If both Codes founded from Reservoir S1
and Reservoir S2 delete this
two codes
Reservoir Code (e.g. AGRCT)
Code ‘G’
Code ‘G’
Code ‘C’
Code ‘C’
Code ‘AG’
Code ‘A’ in S1
Code ‘T’ in S2
Meta-Code Rule
Meta-code (e.g. AMT)Codes
from S1 and S2
Copy the codes fromS1 and S2, p = p –1,output meta-code.
If CM length is valid,reservoir code = r,
Position = p.
Values ofr and p
Value of r
If reservoir code = r,then stop the looping
Looping for creating
meta-code
CM (Lengths & Codes)
A A A A AG
A G GA
GA
AG
AG
T T T T TC
C C CT
CT
TT
CC
Reservoir codes in S1
S1: S2:
MetaCode
Length
T A C T C G G A CT T C G C C A T C
0 1 1 1 1 2 1 2 2R ART ART ART ART AGRCT GRC GARTC GARTC
GT
2AGRTT
TC
2AGRCC
CG
1ARC
Reservoir codes in S2
Composition Matching of S1 and S2 in prefix length
CM of Metacode
Prefix length
CompositionMatching
0
2
1
–1
106 12
2 4GARTCART ART AGRCT
ART ARCART
AGRTT GARTC
Invalid length
Invalid length
2
Composition MSA
T A C G T C G T C G A CT T C T G C C C G A T CT T C TMG GMT C C CMT GMC AMG TMA CT T C G T C C T C G A C
Composition matching
Meta-code MSA
New
S1
S2
S2
| | | |T T C T G C C C G A T C
T A C G T C G T C G A C
Fixed SegmentA = Currency / CardsB = Stock / Structured P.C = Unit Trusts / BondsD = Insurance / Finance
Code catalogue
E = Mortgages / Loans
Week #1 Week #2
Branch bank #1 …Branch bank #2 …Branch bank #3 …
A C B C E B A A E B C A
B E B A A A C E D B E A
A C A A B C E E D B E E
Time Granularities
…1t1 1t2 1t3 1t4 1t5 2t1 2t2 2t3 2t4 2t5 3t1 3t2
․Weekly behaviourSegment Length LS =
5
․Semi-global alignment․Least overlap problem․Simple segmentation․Composition alignment
Family Classifications
Fixed-Segment Composition MSA
Branch bank #1Branch bank #2Branch bank #3
C C B C A D
A A B A D C
A B A A B BCompositionalignment
Family
Group
Meta-Code Branch bank #1Branch bank #2
Meta-Code Branch bank #3
C C B A D C
A A B A D C
A A B A B BPSA
Family
Group
Further Problems
․Fixed-segment length․Prior sequence choice․Speed-up PSAs․Nos. of Segments/Codes
Meta-Code Composition MSA
Conclusions․Fixed-segment Composition (Least Overlap Problems)
․Meta-code Approach (Easier Transform Applications)
․Widespread use of MSA (Simultaneous Multiple Sequences)
top related