introduction to molecular biology bioinformatics: issues and...
TRANSCRIPT
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 1 -
Bioinformatics:Issues and Algorithms
CSE 308-408 • Fall 2007 • Lecture 2
Introduction toMolecular Biology
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 2 -
Important points to remember
CSE 308-408 is NOT a programming course:• Previous programming experience not required to do well.• Being a great programmer does not imply a high grade.• Best grades go to students who work to learn bioinformatics.
We will study:• Problems from bioinformatics.• Algorithms used to solve them.• Perl programming (language of choice for bioinformatics).
Requirements:• Homework and programming assignments.• Final project or paper due at end of semester.• CSE 408 students must also “scribe” one lecture.
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 3 -
Course grading
Scribe duty (CSE 408)* = n/a 10% grade
* Note that CSE 308 and CSE 408 point totals will be different and each will be curved separately.
Homework assignments = 25% grade 20% grade
Programming assignments = 25% grade 20% grade
Final project or paper = 50% grade 50% grade
CSE 308 CSE 408
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 4 -
The Central Dogma of Molecular Biology
1. DNA copies its information in process involving many enzymes (replication).2. DNA codes for production of mRNA during transcription.3. mRNA migrates from nucleus to cytoplasm.4. mRNA carries coded information to ribosomes which "read" it and use it for protein synthesis (translation).
http://allserv.rug.ac.be/~avierstr/principles/centraldogma.html
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 5 -
The Central Paradigm of Bioinformatics
By developing techniques for analyzing sequence data and the structures that result, we can attempt to understand the genetic nature of diseases.
http://cmgm.stanford.edu/biochem218/
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 6 -
“Junk” DNA
http://www.accessexcellence.org/AB/GG/genes.htmlhttp://www.psrast.org/junkdna.htm
At this point in time, <10% of theDNA in the human genome can be associated with genes.The remainder is known as junk DNA because it has no apparent function.
Recall that genes are contiguous stretches along a chromosome.
However, recent studies are showing that non-coding DNA may play an important role in regulating gene expression (enhancing or suppressing expression of proximal genes).
It's also used in forensic analysis as mutations are more likely in non-coding DNA regions than within genes (why?).
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 7 -
Genetic inheritence
http://www.accessexcellence.org/AB/GG/hapDIP.html
4. Zygote splits and reproduces through mitosis to yield multi-cellular diploid organism.
1. Cells in mother and father both contain paired sets of chromosomes (diploid).
2. Through meiosis, gametes (sex cells) contain only one chromosome from each pair (haploid).
3. Fertilized egg cell (zygote) receives one chromosome from mother, one from father.
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 8 -
Crossing over (recombination)
During meiosis, homologous chromosomes may cross over (recombine) forming chromosomes that mix genes from each parent.
http://www.accessexcellence.org/AB/GG/comeiosis.html
The two chromosomes that form a pair are called homologous.
Note that liklihood of recombination is function of distance between two genes. This observation is used in creating genetic linkage maps.
Here we see recombination of gene c/C which appears in two forms (alleles). Genes ab (AB) are unlikely to recombine.
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 9 -
Genomes
Complete set of chromosomes that determines an organism is known as a genome.
Note that each cell in an organism contains its entire genome!
Sizes ofsome genomes
http://www.cbs.dtu.dk/databases/DOGS/ http://www.nsrl.ttu.edu/tmot1/mus_musc.htm http://www.oardc.ohio-state.edu/seedid/single.asp?strID=324
Mus musculus
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 10 -
We're more similar than you might think
http://www.ornl.gov/sci/techresources/Human_Genome/graphics/slides/ttmousehuman.shtmlhttp://www.news.cornell.edu/releases/Dec03/chimp.life.hrs.html
(The DNA of chimpanzees and humans is ~99% similar.)
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 11 -
Genetic linkage map(107 – 108 base pairs)
Studying a genome
Most genomes are enormous (e.g., 1010 base pairs in case of human). Current sequencing technology, on the other hand, only allows biologists to determine ~103 base pairs at a time.
This disparatey leads to some of the most interesting problems in computational biology.
Physical map(105 – 106 base pairs)
Sequencing(103 – 104 base pairs)
ACTAGCTGATCGATTTAGCAGCAG...
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 12 -
Studying a genome
http://www.ornl.gov/sci/techresources/Human_Genome/publicat/primer/primer.pdf
These results are compiledto provide sequence acrossa chromosome.Yeast artificial chromosome (YAC) is designed to “fool” yeast replication mechanism.
Cosmids and plasmids are vectors that can be cloned in bacteria.
Cloned DNA molecules are made progressively smaller and fragments subcloned to obtain pieces small enoughto sequence directly.
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 13 -
Cutting DNA using restriction enzymes
The separated pieces have single-stranded sticky ends, which allow complementary pieces to combine.
http://www.accessexcellence.org/AB/GG/restriction.html
Note that GAATTC → CTTAAG → GAATTC (i.e., palindrome).
A restriction enzyme surrounds DNA molecule at specific point, called restriction site (sequence GAATTC in this case).
It cuts one strand of DNA helix at one point and second strand at a different, complementary point (between G and A).
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 14 -
Breaking DNA
http://occawlonline.pearsoned.com/bookbind/pubbooks/bc_mcampbell_genomics_1/medialib/method/shotgun.html
DNA can also be broken in random places through mechanical means (e.g., vibration). This is typically the first step in shotgun sequencing.
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 15 -
Copying DNA
Most analytic procedures in the lab require a quantity of the DNA under study. The process of copying DNA is known as amplification.
As we have seen, one possible approach is to use nature: insert the DNA of interest into the genome of a host (or vector) and let the organism multiply itself. This is called recombinant DNA.
http://www.accessexcellence.org/AB/GG/plasmid.html
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 16 -
Polymerase Chain Reaction
http://www.accessexcellence.org/AB/GG/polymerase.htmlhttp://www.iupui.edu/~wellsctr/MMIA/htm/animations.htm
PCR rapidly amplifies a single DNA molecule into billions of molecules.
Another way to amplify DNA is polymerase chain reaction (PCR).
PCR alternates two phases: separate DNA into single strands using heat; convert into double strands using primer and polymerase reaction.
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 17 -
Polymerase Chain Reaction
http://www.dnalc.org/ddnalc/resources/pcr.html
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 18 -
Reading DNA
http://www.apelex.fr/anglais/applications/sommaire2/sanger.htmhttp://www.iupui.edu/~wellsctr/MMIA/htm/animations.htm
In general, DNA molecules with similar lengths will migrate same distance.
This is known as sequencing.
First "cut" DNA at each base: A, C, G, T. Then run gel and read off sequence: TCGCGA ...
Gel electrophoresis is process of separating a mixture of molecules in a gel media by application of an electric field.
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 19 -
Gel electrophoresis
http://www.dnalc.org/ddnalc/resources/electrophoresis.html
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 20 -
Reading DNA
ATCGTGTCGATAATCGTGTCGAA
A
Original sequence: ATCGTGTCGATAGCGCT
G ATCGTGTCGATAGATCGTGTCGATCGTGATCG
ATCGTGTCGATAGCG
T ATCGTGTCGATATCGTGTCGATAGCGCTATCGTGTATCGTAT
ATCGTGTCGATAGCGCATCGTGTCGATAGCATCGTGTCATC
C
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 21 -
Sanger sequencing
http://www.dnalc.org/ddnalc/resources/sangerseq.html
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 22 -
DNA sequencing
http://www.bii.a-star.edu.sg/docs/sbg/notes/n1/hu-genome%20lecture2.pdf
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 23 -
Sequence assembly
fragments
fragmentassembly
original
targetcontigcontig
gap
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 24 -
Sequence assembly
Simple model of DNA assembly is Shortest Supersequence Problem: given a set of sequences, find shortest sequence S such that each of original sequences appears as subsequence of S.
TTACCGTGC
ACCGT
CGTGCTTAC
--ACCGT------CGTGCTTAC-----3
12
Look for overlap between prefix ofone sequence and suffix of another:
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 25 -
Inferring gene functionality
• DNA microarrays allow biologists to infer gene function whenthere is insufficent evidencebased on similarity alone.
http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt
• For 40% of sequenced genes, functionality cannot be ascertained by such techniques.
• Researchers want to know functions of new genes.
• Simply comparing new gene sequences to known DNAoften does not reveal actual function of gene.
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 26 -
DNA microarray analysis
• More mRNA usually indicates more gene activity.
http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt
• DNA microarrays measure the activity (expression level) of the gene under varying conditions/time points.
• Expression level is estimated by measuring the amount of mRNA for that particular gene.
• A gene is active if it is being transcribed.
• Measurements are relative, not absolute!
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 27 -
DNA microarray experiments
• Scan microarray multiple times for different color phosphors.
http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt
• Analyze mRNA produced from cells in tissue with environmental conditions you are testing.
• Produce cDNA from mRNA (DNA is more stable).• Attach phosphor to cDNA to see when a particular gene is
expressed.
• Different color phosphors are available to compare many samples at once.
• Hybridize cDNA over the microarray.• Scan the microarray with a phosphor-illuminating laser.
• Illumination reveals transcribed genes.
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 28 -
DNA microarray experiments
Phosphors can be added here instead ...... then instead of
staining, laser illumination is used
http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 29 -
DNA microarrays
http://www.bio.davidson.edu/Courses/genomics/chip/chip.html
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 30 -
Using DNA microarrays
Each box represents one gene’s expression over time
• Track sample over a period of time to see gene expression over time.
• Track two different samples under same conditions to see difference in gene expressions.
http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 31 -
Using DNA microarrays
• Green: expressed only from control.• Red: expresses only from experimental cell.• Yellow: equally expressed in both samples.• Black: NOT expressed in either
control or experimental cells.
http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 32 -
DNA microarray data
321Gene 5387Gene 438.64Gene 39010Gene 2
10810Gene 1Time ZTime YTime XTime:
Intensity (expression level) of gene at measured time
• Microarray data are usually transformed into an intensity matrix (see below).
• The intensity matrix allows biologists to make correlations between diferent genes (even if they are dissimilar) and to understand how genes functions might be related.
• Clustering comes into play (more on this later).
http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 33 -
Visualizing microarray data
From “Cluster analysis and display of genome-wide expression patterns” by Eisen, Spellman, Brown, and Botstein, Proc. Natl. Acad. Sci. USA, Vol. 95, pp. 14863–14868, December 1998
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 34 -
Building the “Tree of Life”
http
://en
.wik
iped
ia.o
rg/w
iki/P
hylo
gene
tic_t
ree
http
://us
ers.
rcn.
com
/jkim
ball.
ma.
ultra
net/B
iolo
gyP
ages
/T/T
axon
omy.
htm
l
Scientists build phylogenetic trees in an attempt to understand evolutionary relationships.
(These trees are “best guesses” and certainly contain errors.)
CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 35 -
Wrap-up
Remember:• Come to class having done the readings.• Check Blackboard regularly for updates.
Readings for next time:• IBA Chapter 2 on algorithms (skim if already familiar).• BB&P Chapter 2 on software (skim if already familiar).