vladimir v. ufimtsev
DESCRIPTION
GENERATION OF DNA CODES. Vladimir V. Ufimtsev. Adviser: Dr. V. Rykov. Historical Background. 1948. A Mathematical Theory of Communication C.E. Shannon. Main result: Entropy function - average value of information obtained from a channel. 1950. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/1.jpg)
Vladimir V. Ufimtsev
Adviser: Dr. V. Rykov
![Page 2: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/2.jpg)
A Mathematical Theory of Communication C.E. Shannon Main result: Entropy function - average value of information obtained from a channel.
Error Detecting and Error Correcting Codes R.W. Hamming
Main result: Matrices that can be used to encode messages and provide more reliable transmission across a channel.
A structure for Deoxyribose Nucleic Acid J. D. WATSON, F. H. C. CRICK, M. H. F. Wilkins, R. E. Franklin,
Main result: Structure found for the building block of life.
There’s Plenty of Room at the Bottom R.P. Feynman
Main result: Anticipated Science at the nanoscale ( meters).910
![Page 3: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/3.jpg)
1 2( , ,..., )nx x xx i qx A
{0,1,... 1}qA q
nqFLet denote a set consisting of all vectors (codewords) of
length n built over
i.e. nqFx
Let such that: ,...}2,1,0{: nq
nq FFd
nqF zyx ,, 1)
2)
3)
yxyx 0),(d),(),( xyyx dd
),(),(),( yzzxyx ddd
Let be such that: n
qq FdMnC ),,(),,(, dMnCq yx dd ),( yx
MdMnCq ),,(
),,( dMnCq is referred to as a Code of length n, size M, and minimum distance d.
![Page 4: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/4.jpg)
ddFinSphdnV nq
d
iqq
),(,:),,(),,(0
yxyyxx
nqF
}),(,:{),,( ddFdnSph nqq yxyyx
Volume of the sphere around x, of radius d:
A sphere in centered at x having radius d:
A space is HOMOGENEOUS when the volume of a sphere does not depend on where it is centered i.e.
)),,(),,()(,0)(,( dnVdnVndF qqn
q yxyx
A space is NON - HOMOGENEOUS when the volume of a sphere does depend on where it is centered.
![Page 5: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/5.jpg)
For any code there are 3 conflicting parameters;
Length: n
Size: M
Minimum distance: d
The aim of coding theory is:
Given any 2 parameters, find the optimal value for the3rd. We need small n for fast transmission, large M foras much information as possible to be encoded and large d so that we can detect and correct many errors.
![Page 6: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/6.jpg)
Exact formulas for sphere volumes and code sizes are extremely difficult to obtain sometimes. In most cases only upper and lower bounds can be obtained for these parameters.
We will be working in a NON-HOMOGENEOUS space making the obtainment of exact formulas for sphere volumes and code sizes VERY HARD.
Hamming Upper Bound on Code Size in with any metric:n
qF
nqq q
dnVdMnC
2
1,),,( min
21
,
),,(min d
nV
qdMnC
q
n
q
Varshamov-Gilbert Lower Bound on Code Size in with any metric:
nqF
1,),,( maxmax dnVdMnCq qqn
),,(1,
maxmax
dMnCdnV
q
n
![Page 7: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/7.jpg)
Let G be a simple graph on vertices and e edges. G contains an M-clique if:
nq
21
11
2n
Me
CLIQUES:
![Page 8: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/8.jpg)
)1,(2
)1,,(
2
1)1,,(
2
1 2
dnVqq
q
dnV
qqdnVqe avgq
nn
n
Fq
nn
Fq
nn
q
nq
x
x
x
x
)1,(221
11
2
dnVq
Mavg
qn
nn
)1,(1
dnV
qM
avgq
n
If:
Then there exists a code of size M.
),(max dnCMq
![Page 9: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/9.jpg)
Let
)1,( dnV
qM
avgq
n
1)1,()1,(
dnV
qM
dnV
qavg
q
n
avgq
n
Then:
Hence there exists a code of size M and so:
),()1,(
max dnCdnV
qqavg
q
n
![Page 10: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/10.jpg)
The rules of base pairing (nucleotide paring):
• A - T: adenine (A) always pairs with thymine (T) • C - G: cytosine (C) always pairs with guanine (G)
![Page 11: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/11.jpg)
• Each base has a bonding surface• Bonding surface of A is complementary to that of T (2
bonds)• Bonding surface of G is complementary to that of C (3
bonds) • Hybridization is a process that joins two complementary
opposite polarity single strands into a double strand through hydrogen bonds.
![Page 12: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/12.jpg)
Orientation of single DNA strands is important for hybridization.
![Page 13: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/13.jpg)
Direct
Shifted
Folded
Loop
![Page 14: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/14.jpg)
Interest into DNA computing was sparked in 1994 by Len Adleman.
Adleman showed how we can use DNA molecules to solve a mathematical problem. (Hamiltonian path problem).
DNA computing relies on the fact that DNA strands can be represented as sequences of bases (4-ary sequences) and the property of hybridization.
In Hybridization, errors can occur. Thus, error-correcting codes are required for efficient synthesis of DNA strands to be used in computing.
![Page 15: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/15.jpg)
Sequence ),...,,( 21 kzzzz),...,,( 21 nxxxx
),...,,( 21 kiii
is a subsequence of
if and only if there exists a strictly increasing sequence of indices:
Such that: jij xzj ,
is defined to be the set of longest common subsequences of
),( yxLCSx and y
),( yxL is defined to be the length of the longest common subsequence of x and y
![Page 16: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/16.jpg)
• X = ( A T C T G A T )
Z = ( T C G T ) - subsequence of X
• X = ( A T C T G A T )
Y = ( T G C A T A )( T C A T )– L (X,Y)
LCS(X ,Y) = 4
![Page 17: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/17.jpg)
Original Insertion-Deletion metric (Levenshtein 1966):n
qFyx,)),((2)),(()),((),( yxyxyxyx LnLnLnd
),( yxLn
This metric results from the number of deletions and insertions that need to be made to obtain ‘ y ’ from ‘ x ’.
For vectors that have the same length:
the number of deletions that will be made is:
likewise, the number of insertions that will be made is:
),( yxLn
![Page 18: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/18.jpg)
mz,2),,...,,( 21 nmzzz m z
1,...,2,1,, 1 mizz ii
A common subsequence is called a common stacked pair subsequence of length between x and y if two elements , are consecutive in x and consecutive in y or if they are non -consecutive in x and or non-consecutive in y, then and are consecutive in x and y.
nSS ),(0),,( yxyx z
),( yxS
Let , denote the length of the longest sequence occurring as a common stacked pair subsequence subsequence z between sequences x and y. The number , is called a similarity of blocks between x and y. The metric is defined to be
ii zz ,1 21, ii zz
),(),( yxyx Snd
![Page 19: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/19.jpg)
j
jkd
kj
d
k
j
j
jdnqdnV
j
k
dn
j
davgq
11
1)1,(
1
0
2
1
1
1
The upper bound for the average sphere volume in this metric will be:
The Varshamov-Gilbert bound becomes:
),(
11
1
max
1
0
2
1
1
1
dnC
j
jkd
kj
d
k
j
j
jdn
j
k
dn
j
dn
![Page 20: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/20.jpg)
A C G T
A 1.00 1.44 1.28 0.88
C 1.45 1.84 2.17 1.28
G 1.30 2.24 1.84 1.44
T 0.58 1.30 1.45 1.00
Thermodynamic weight of virtual stacked pairs.
•Can use statistical estimation of sphere volume.
![Page 21: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/21.jpg)
• There are many possibilities for metrics on the space of DNA sequences.
• All discussed metrics are non-homogeneous i.e. the sizes of the spheres in the metric spaces depend on the location of their centers.
• A universal method that will allow us to calculate lower bounds for optimal code sizes was given.
![Page 22: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/22.jpg)
Length (n) Min. size
15 8
16 15
17 28
18 53
19 107
20 223
21 479
22 1055
23 2386
24 5524
25 13068
26 31545
27 77600
28 1943016
29 494758
30 1279652
Minimum distance (d) = 6
![Page 23: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/23.jpg)
Length (n) Min. size
15 2
16 3
17 5
18 8
19 13
20 24
21 46
22 90
23 183
24 381
25 815
26 1783
27 3988
28 9102
29 21174
30 50155
Minimum distance (d) = 7
![Page 24: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/24.jpg)
Length (n) Min. size
20 4
21 7
22 12
23 21
24 39
25 75
26 149
27 304
28 635
29 1354
30 2946
Minimum distance (d) = 8
![Page 25: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/25.jpg)
Length (n) Min. size
20 1
21 2
22 2
23 4
24 6
25 10
26 18
27 33
28 62
29 121
30 243
Minimum distance (d) = 9
![Page 26: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/26.jpg)
Length (n) Min. size
25 2
26 3
27 5
28 8
29 15
30 27
Minimum distance (d) = 10
![Page 27: Vladimir V. Ufimtsev](https://reader033.vdocuments.us/reader033/viewer/2022061511/568158f9550346895dc6349c/html5/thumbnails/27.jpg)
Length LCS Min dist. Size V-G bound
10 8 2 4365
14 12 2 580715
12 8 4 482 25
14 10 4 2683 151
16 12 4 1042
18 14 4 7989
20 16 4 66413
22 18 4 588872
24 20 4 5504930
14 8 6 66 1
16 10 6 204 3
18 12 6 767 13
20 14 6 2843 65
22 16 6 364
24 18 6 2279
16 8 8 28 1
18 10 8 50 1
20 12 8 122 1
22 14 8 345 2
24 16 8 1084 7
22 12 10 45 1
24 14 10 86 1