sequencing a genome. approximate molecular dynamics: new algorithms with applications in protein...
TRANSCRIPT
![Page 1: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/1.jpg)
Sequencing a genome
![Page 2: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/2.jpg)
Approximate Molecular Dynamics:New Algorithms with Applications in Protein Folding
Author: Qun (Marc) MaPredicting the 3D native structures of proteins from the
known amino acid sequence, i.e., protein folding, has become pressing in structural genomics and computational biology. Though it is plausible to use molecular dynamics (MD) simulations to study the folding of proteins, the currently available methodologies are incapable of addressing the timescale problems.
In this talk, I will describe the recent advances in the development of two new multiscale integrators that allow very large time steps (and thus ``approximate'' molecular dynamics)
![Page 3: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/3.jpg)
Definition
• Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism
![Page 4: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/4.jpg)
Basic problem
• Genomes are large (typically millions or billions of base pairs)
• Current technology can only reliably ‘read’ a short stretch – typically hundreds of base pairs
![Page 5: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/5.jpg)
Elements of a solution
• Automation – over the past decade, the amount of hand-labor in the ‘reads’ has been steadily and dramatically reduced
• Assembly of the reads into sequences is an algorithmic and computational problem
![Page 6: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/6.jpg)
A human drama
• There are competing methods of assembly
• The competing – public and private – sequencing teams used competing assembly methods
![Page 7: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/7.jpg)
Assembly:
• Putting sequenced fragments of DNA into their correct chromosomal positions
![Page 8: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/8.jpg)
BAC
• Bacterial artificial chromosome: bacterial DNA spliced with a medium-sized fragment of a genome (100 to 300 kb) to be amplified in bacteria and sequenced.
![Page 9: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/9.jpg)
Contig
• Contiguous sequence of DNA created by assembling overlapping sequenced fragments of a chromosome (whether natural or artificial, as in BACs)
![Page 10: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/10.jpg)
Cosmid
• DNA from a bacterial virus spliced with a small fragment of a genome (45 kb or less) to be amplified and sequenced
![Page 11: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/11.jpg)
Directed sequencing
• Successively sequencing DNA from adjacent stretches of chromosome
![Page 12: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/12.jpg)
Draft sequence
• Sequence with lower accuracy than a finished sequence; some segments are missing or in the wrong order or orientation
![Page 13: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/13.jpg)
EST
• Expressed sequence tag: a unique stretch of DNA within a coding region of a gene; useful for identifying full-length genes and as a landmark for mapping
![Page 14: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/14.jpg)
Exon
• Region of a gene’s DNA that encodes a portion of its protein; exons are interspersed with noncoding introns
![Page 15: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/15.jpg)
Genome
• The entire chromosomal genetic material of an organism
![Page 16: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/16.jpg)
Intron
• Region of a gene’s DNA that is not translated into a protein
![Page 17: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/17.jpg)
Kilobase (kb)
• Unit of DNA equal to 1000 bases
![Page 18: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/18.jpg)
Locus
• Chromosomal location of a gene or other piece of DNA
![Page 19: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/19.jpg)
Megabase (mb)
• Unit of DNA equal to 1 million bases
![Page 20: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/20.jpg)
PCR
• Polymerase chain reaction: a technique for amplifying a piece of DNA quickly and cheaply
![Page 21: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/21.jpg)
Physical map
• A map of the locations of identifiable markers spaced along the chromosomes; a physical map may also be a set of overlapping clones
![Page 22: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/22.jpg)
Plasmid
• Loop of bacterial DNA that replicates independently of the chromosomes; artificial plasmids can be inserted into bacteria to amplify DNA for sequencing
![Page 23: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/23.jpg)
Regulatory region
• A segment of DNA that controls whether a gene will be expressed and to what degree
![Page 24: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/24.jpg)
Repetitive DNA
• Sequences of varying lenths that occur in multiple copies in the genome; it represents much of the genome
![Page 25: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/25.jpg)
Restriction enzyme
• An enzyme that cuts DNA at specific sequences of base pairs
![Page 26: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/26.jpg)
RFLP
• Restriction fragment length polymorphism: genetic variation in the length of DNA fragments produced by restriction enzymes; useful as markers on maps
![Page 27: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/27.jpg)
Scaffold
• A series of contigs that are in the right order but are not necessarily connected in one continuous stretch of sequence
![Page 28: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/28.jpg)
Shotgun sequencing
• Breaking DNA into many small pieces, sequencing the pieces, and assembling the fragments
![Page 29: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/29.jpg)
STS
• Sequence tagged site: a unique stretch of DNA whose location is known; serves as a landmark for mapping and assembly
![Page 30: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/30.jpg)
YAC
• Yeast artificial chromosome: yeast DNA spliced with a large fragment of a genome (up to 1 mb) to be amplified in yeast cells and sequenced
![Page 31: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/31.jpg)
Readings
• Myers, “Whole Genome DNA Sequencing,” www.cs.arizona.edu/people/gene/PAPERS/whole.IEEE.pdf
• Venter, et al, “The Sequence of the Human Genome,” Science, 16 Feb 2001, Vol. 291 No 5507, 1304 (parts 1 & 2)
• Waterston, Lander, Sulston, “On the sequencing of the human genome,” PNAS, March 19, 2002, Vol 99, no 6, 3712-3716
• Myers, et.al., “On the sequencing and assembly of the human genome,” www.pnas.org/cgi/doi/10.1073/pnas.092136699
![Page 32: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/32.jpg)
Hierarchical sequencing
• Create a high-level physical map, using ESTs and STSs
• Shred genome into overlapping clones
• Multiply clones in BACs
• ‘shotgun’ each clone
• Read each ‘shotgunned’ fragment
• Assemble the fragments
![Page 33: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/33.jpg)
Physical map
![Page 34: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/34.jpg)
Whole genome sequencing (WGS)
• Make multiple copies of the target
• Randomly ‘shotgun’ each target, discarding very big and very small pieces
• Read each fragment
• Reassemble the ‘reads’
![Page 35: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/35.jpg)
Hierarchical v. whole-genome
![Page 36: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/36.jpg)
The fragment assembly problem
• Aim: infer the target from the reads
• Difficulties –– Incomplete coverage. Leaves contigs separated
by gaps of unknown size.– Sequencing errors. Rate increases with length
of read. Less than some .– Unknown orientation. Don’t know whether to
use read or its Watson-Crick complement.
![Page 37: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/37.jpg)
Scaling and computational complexity
• Increasing size of target G. – 1990 – 40kb (one cosmid)– 1995 – 1.8 mb (H. Influenza)– 2001 – 3,200 mb (H. sapiens)
![Page 38: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/38.jpg)
The repeat problem
• Repeats– Bigger G means more repeats– Complex organisms have more repetitive
elements– Small repeats may appear multiple times in a
read – Long repeats may be bigger than reads (no
unique region)
![Page 39: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/39.jpg)
Gaps
• Read length LR hasn’t changed much
• = LR /G gets steadily smaller
• Gaps ~ Re- R (Waterman & Lander)
![Page 40: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/40.jpg)
How deep must coverage be?
![Page 41: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/41.jpg)
Double-barreled shotgun sequencing
• Choose longer fragments (say, 2 x LR)
• Read both ends
• Such fragments probably span gaps
• This gives an approximate size of the gap
• This links contigs into scaffolds
![Page 42: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/42.jpg)
Genomic results
![Page 43: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/43.jpg)
HGSC v Celera results
![Page 44: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/44.jpg)
To do or not to do?
• “The idea is gathering momentum. I shiver at the thought.” – David Baltimore, 1986
• “If there is anything worth doing twice, it’s the human genome.” – David Haussler, 2000
![Page 45: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/45.jpg)
Public or private?
• “This information is so important that it cannot be proprietary.” – C Thomas Caskey, 1987
• “If a company behaves in what scientists believe is a socially responsible manner, they can’t make a profit.” – Robert Cook-Deegan, 1987
![Page 46: Sequencing a genome. Approximate Molecular Dynamics: New Algorithms with Applications in Protein Folding Author: Qun (Marc) Ma Predicting the 3D native](https://reader036.vdocuments.us/reader036/viewer/2022062518/56649ec75503460f94bd32ae/html5/thumbnails/46.jpg)
HW for Feb 19• Comment on these assertions 500-1000
words:– WLS – “Our analysis indicates that the Celera
paper provides neither a meaningful test of the WGS approach nor an independent sequence of the human genome.”
– Venter – “This conclusion is based on incorrect assumptions and flawed reasoning.”
• Lesk, Exercise 2.15, problem 2.3