reconstruction of dna sequencing by hybridization ji-hong zhang, ling-yun wu and xiang-sun zhang...
TRANSCRIPT
Reconstruction of DNA sequencing by hybridization
Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang
[email protected] of Applied Mathematics, AMSS, CAS
Bioinformatics
Human Genome Project Large molecule data in biology, such as DNA
and protein Knowledge of mathematics, computer
science, information science, physics, system science, management science as well as biology
Genomics DNA sequencing Gene prediction Sequence alignment
DNA Sequencing
…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…
DNA Sequencing (shotgun)
cut many times at random
known dist forward-reverse linked reads
~500 bp~500 bp
target DNA
DNA Sequencing (SBH)
DNA array (DNA chip) with 43 probes Target DNA: AAATGCG
Sequencing by Hybridization
Hybridize target to array containing a spot for each possible k-tuple (k-mer)The spectrum of a sequence multi-set of all its k-long substrings (k-tuples)
Goal reconstruct the sequence from its spectrum
Pevzner (1989): reconstruction is polynomialBut …
Uniqueness of Reconstruction
Different sequences can have the same spectrum: ACT, CTA, TAC
ACTAC TACTA
Non-uniqueness Probability
Experiment Errors
Hybridization experiments are error proneFalse negative error k-tuple appears in target DNA but does not appe
ar in its measured spectrum Repetition of k-tuple
False positive error k-tuple does not appear in target DNA but does
appear in its measured spectrum
Sequencing by Hybridization
Target DNA ……TTTTACGC……
Spectrum
Errors: Positive (misread) / Negative (missing, repetition)
TTT TTT TTA TAC ACG CGC
Ideal case
TTT TTT TTA TAC ACG CGC TGA
With errors
SBH Reconstruction Problem
In the case of error-free SBH experiments A desired solution of SBH is just a feasible soluti
on including all k-tuple in the specturmFor the general case There is no additional information except spectr
um and the length of target DNA A feasible solution composed of a maximum car
dinality subset of the spectrum shall be a reasonable desired solution
SBH Reconstruction Problem
Ideal case (without repetitions and errors) Equivalent to finding an Eulerian path in a corre
sponding graph (Pevzner, 1989) A linear time algorithm (Fleischner, 1990)
General case is NP-hard problem Branch and bound Heuristics
Extensions PSBH (Positional SBH) SBH with length error
Motivations
Give some criteria which can determine the most possible k-tuples at both ends and in the middle of all possible reconstructions of the target DNA
These criterions greatly reduce ambiguities in the reconstruction of DNA
Transform the negative errors into the positive errors
These means enables us to handle both types of errors easily
Separate the repetitions from both type of errors
Methods
Estimate the number of k-tuples that does not occur in a solution Adjacency matrix (connection matrix) Give a lower bound of k-tuples that does not occ
ur in all solutions from k-tuple i to j
Methods
Determine the most possible k-tuples at both ends Reconstruct from the most possible end pairs to
get an upper bound of SBH problem Purge the end pairs that can not have better sol
ution than current upper bound
Methods
Transform the negative errors into the positive errors Artificial k-tuple
Fill in all the possible gaps due to false negative error Negative error level
The maximal number of allowed consecutively missing k-tuples
Reduce the number of artificial k-tuples
Computational Experiments
109 DNA sequence from GenBankSimulate the SBH experimentsError models Randomly (probabilistic model) Systematically (one base mismatched m
odel)
Conclusions
Ideal case (without repetitions and errors) can be solved in polynomial time (Pevzner, 1989)General case is NP-hard problemDesign efficient algorithms
Ji-Hong Zhang, Ling-Yun Wu and Xiang-Sun Zhang. A new approach to the reconstruction of DNA sequencing by hybridization. Bioinformatics, vol 19(1), pages 14-21, 2003.
Xiang-Sun Zhang, Ji-Hong Zhang and Ling-Yun Wu. Combinatorial optimization problems in the positional DNA sequencing by hybridization and its algorithms. System Sciences and Mathematics, vol 3, 2002. (in Chinese)
Ling-Yun Wu, Ji-Hong Zhang and Xiang-Sun Zhang. Application of neural networks in the reconstruction of DNA sequencing by hybridization. In Proceedings of the 4th ISORA, 2002.