introduction to molecular biology bioinformatics: issues and...

35
CSE 308-408 · Bioinformatics: Issues and Algorithms Lopresti · Fall 2007 · Lecture 2 - 1 - Bioinformatics: Issues and Algorithms CSE 308-408 Fall 2007 Lecture 2 Introduction to Molecular Biology

Upload: others

Post on 21-May-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 1 -

Bioinformatics:Issues and Algorithms

CSE 308-408 • Fall 2007 • Lecture 2

Introduction toMolecular Biology

Page 2: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 2 -

Important points to remember

CSE 308-408 is NOT a programming course:• Previous programming experience not required to do well.• Being a great programmer does not imply a high grade.• Best grades go to students who work to learn bioinformatics.

We will study:• Problems from bioinformatics.• Algorithms used to solve them.• Perl programming (language of choice for bioinformatics).

Requirements:• Homework and programming assignments.• Final project or paper due at end of semester.• CSE 408 students must also “scribe” one lecture.

Page 3: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 3 -

Course grading

Scribe duty (CSE 408)* = n/a 10% grade

* Note that CSE 308 and CSE 408 point totals will be different and each will be curved separately.

Homework assignments = 25% grade 20% grade

Programming assignments = 25% grade 20% grade

Final project or paper = 50% grade 50% grade

CSE 308 CSE 408

Page 4: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 4 -

The Central Dogma of Molecular Biology

1. DNA copies its information in process involving many enzymes (replication).2. DNA codes for production of mRNA during transcription.3. mRNA migrates from nucleus to cytoplasm.4. mRNA carries coded information to ribosomes which "read" it and use it for protein synthesis (translation).

http://allserv.rug.ac.be/~avierstr/principles/centraldogma.html

Page 5: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 5 -

The Central Paradigm of Bioinformatics

By developing techniques for analyzing sequence data and the structures that result, we can attempt to understand the genetic nature of diseases.

http://cmgm.stanford.edu/biochem218/

Page 6: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 6 -

“Junk” DNA

http://www.accessexcellence.org/AB/GG/genes.htmlhttp://www.psrast.org/junkdna.htm

At this point in time, <10% of theDNA in the human genome can be associated with genes.The remainder is known as junk DNA because it has no apparent function.

Recall that genes are contiguous stretches along a chromosome.

However, recent studies are showing that non-coding DNA may play an important role in regulating gene expression (enhancing or suppressing expression of proximal genes).

It's also used in forensic analysis as mutations are more likely in non-coding DNA regions than within genes (why?).

Page 7: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 7 -

Genetic inheritence

http://www.accessexcellence.org/AB/GG/hapDIP.html

4. Zygote splits and reproduces through mitosis to yield multi-cellular diploid organism.

1. Cells in mother and father both contain paired sets of chromosomes (diploid).

2. Through meiosis, gametes (sex cells) contain only one chromosome from each pair (haploid).

3. Fertilized egg cell (zygote) receives one chromosome from mother, one from father.

Page 8: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 8 -

Crossing over (recombination)

During meiosis, homologous chromosomes may cross over (recombine) forming chromosomes that mix genes from each parent.

http://www.accessexcellence.org/AB/GG/comeiosis.html

The two chromosomes that form a pair are called homologous.

Note that liklihood of recombination is function of distance between two genes. This observation is used in creating genetic linkage maps.

Here we see recombination of gene c/C which appears in two forms (alleles). Genes ab (AB) are unlikely to recombine.

Page 9: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 9 -

Genomes

Complete set of chromosomes that determines an organism is known as a genome.

Note that each cell in an organism contains its entire genome!

Sizes ofsome genomes

http://www.cbs.dtu.dk/databases/DOGS/ http://www.nsrl.ttu.edu/tmot1/mus_musc.htm http://www.oardc.ohio-state.edu/seedid/single.asp?strID=324

Mus musculus

Page 10: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 10 -

We're more similar than you might think

http://www.ornl.gov/sci/techresources/Human_Genome/graphics/slides/ttmousehuman.shtmlhttp://www.news.cornell.edu/releases/Dec03/chimp.life.hrs.html

(The DNA of chimpanzees and humans is ~99% similar.)

Page 11: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 11 -

Genetic linkage map(107 – 108 base pairs)

Studying a genome

Most genomes are enormous (e.g., 1010 base pairs in case of human). Current sequencing technology, on the other hand, only allows biologists to determine ~103 base pairs at a time.

This disparatey leads to some of the most interesting problems in computational biology.

Physical map(105 – 106 base pairs)

Sequencing(103 – 104 base pairs)

ACTAGCTGATCGATTTAGCAGCAG...

Page 12: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 12 -

Studying a genome

http://www.ornl.gov/sci/techresources/Human_Genome/publicat/primer/primer.pdf

These results are compiledto provide sequence acrossa chromosome.Yeast artificial chromosome (YAC) is designed to “fool” yeast replication mechanism.

Cosmids and plasmids are vectors that can be cloned in bacteria.

Cloned DNA molecules are made progressively smaller and fragments subcloned to obtain pieces small enoughto sequence directly.

Page 13: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 13 -

Cutting DNA using restriction enzymes

The separated pieces have single-stranded sticky ends, which allow complementary pieces to combine.

http://www.accessexcellence.org/AB/GG/restriction.html

Note that GAATTC → CTTAAG → GAATTC (i.e., palindrome).

A restriction enzyme surrounds DNA molecule at specific point, called restriction site (sequence GAATTC in this case).

It cuts one strand of DNA helix at one point and second strand at a different, complementary point (between G and A).

Page 14: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 14 -

Breaking DNA

http://occawlonline.pearsoned.com/bookbind/pubbooks/bc_mcampbell_genomics_1/medialib/method/shotgun.html

DNA can also be broken in random places through mechanical means (e.g., vibration). This is typically the first step in shotgun sequencing.

Page 15: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 15 -

Copying DNA

Most analytic procedures in the lab require a quantity of the DNA under study. The process of copying DNA is known as amplification.

As we have seen, one possible approach is to use nature: insert the DNA of interest into the genome of a host (or vector) and let the organism multiply itself. This is called recombinant DNA.

http://www.accessexcellence.org/AB/GG/plasmid.html

Page 16: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 16 -

Polymerase Chain Reaction

http://www.accessexcellence.org/AB/GG/polymerase.htmlhttp://www.iupui.edu/~wellsctr/MMIA/htm/animations.htm

PCR rapidly amplifies a single DNA molecule into billions of molecules.

Another way to amplify DNA is polymerase chain reaction (PCR).

PCR alternates two phases: separate DNA into single strands using heat; convert into double strands using primer and polymerase reaction.

Page 17: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 17 -

Polymerase Chain Reaction

http://www.dnalc.org/ddnalc/resources/pcr.html

Page 18: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 18 -

Reading DNA

http://www.apelex.fr/anglais/applications/sommaire2/sanger.htmhttp://www.iupui.edu/~wellsctr/MMIA/htm/animations.htm

In general, DNA molecules with similar lengths will migrate same distance.

This is known as sequencing.

First "cut" DNA at each base: A, C, G, T. Then run gel and read off sequence: TCGCGA ...

Gel electrophoresis is process of separating a mixture of molecules in a gel media by application of an electric field.

Page 19: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 19 -

Gel electrophoresis

http://www.dnalc.org/ddnalc/resources/electrophoresis.html

Page 20: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 20 -

Reading DNA

ATCGTGTCGATAATCGTGTCGAA

A

Original sequence: ATCGTGTCGATAGCGCT

G ATCGTGTCGATAGATCGTGTCGATCGTGATCG

ATCGTGTCGATAGCG

T ATCGTGTCGATATCGTGTCGATAGCGCTATCGTGTATCGTAT

ATCGTGTCGATAGCGCATCGTGTCGATAGCATCGTGTCATC

C

Page 21: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 21 -

Sanger sequencing

http://www.dnalc.org/ddnalc/resources/sangerseq.html

Page 22: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 22 -

DNA sequencing

http://www.bii.a-star.edu.sg/docs/sbg/notes/n1/hu-genome%20lecture2.pdf

Page 23: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 23 -

Sequence assembly

fragments

fragmentassembly

original

targetcontigcontig

gap

Page 24: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 24 -

Sequence assembly

Simple model of DNA assembly is Shortest Supersequence Problem: given a set of sequences, find shortest sequence S such that each of original sequences appears as subsequence of S.

TTACCGTGC

ACCGT

CGTGCTTAC

--ACCGT------CGTGCTTAC-----3

12

Look for overlap between prefix ofone sequence and suffix of another:

Page 25: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 25 -

Inferring gene functionality

• DNA microarrays allow biologists to infer gene function whenthere is insufficent evidencebased on similarity alone.

http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt

• For 40% of sequenced genes, functionality cannot be ascertained by such techniques.

• Researchers want to know functions of new genes.

• Simply comparing new gene sequences to known DNAoften does not reveal actual function of gene.

Page 26: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 26 -

DNA microarray analysis

• More mRNA usually indicates more gene activity.

http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt

• DNA microarrays measure the activity (expression level) of the gene under varying conditions/time points.

• Expression level is estimated by measuring the amount of mRNA for that particular gene.

• A gene is active if it is being transcribed.

• Measurements are relative, not absolute!

Page 27: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 27 -

DNA microarray experiments

• Scan microarray multiple times for different color phosphors.

http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt

• Analyze mRNA produced from cells in tissue with environmental conditions you are testing.

• Produce cDNA from mRNA (DNA is more stable).• Attach phosphor to cDNA to see when a particular gene is

expressed.

• Different color phosphors are available to compare many samples at once.

• Hybridize cDNA over the microarray.• Scan the microarray with a phosphor-illuminating laser.

• Illumination reveals transcribed genes.

Page 28: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 28 -

DNA microarray experiments

Phosphors can be added here instead ...... then instead of

staining, laser illumination is used

http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt

Page 29: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 29 -

DNA microarrays

http://www.bio.davidson.edu/Courses/genomics/chip/chip.html

Page 30: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 30 -

Using DNA microarrays

Each box represents one gene’s expression over time

• Track sample over a period of time to see gene expression over time.

• Track two different samples under same conditions to see difference in gene expressions.

http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt

Page 31: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 31 -

Using DNA microarrays

• Green: expressed only from control.• Red: expresses only from experimental cell.• Yellow: equally expressed in both samples.• Black: NOT expressed in either

control or experimental cells.

http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt

Page 32: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 32 -

DNA microarray data

321Gene 5387Gene 438.64Gene 39010Gene 2

10810Gene 1Time ZTime YTime XTime:

Intensity (expression level) of gene at measured time

• Microarray data are usually transformed into an intensity matrix (see below).

• The intensity matrix allows biologists to make correlations between diferent genes (even if they are dissimilar) and to understand how genes functions might be related.

• Clustering comes into play (more on this later).

http://www.bioalgorithms.info/presentations/Ch10_Clustering.ppt

Page 33: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 33 -

Visualizing microarray data

From “Cluster analysis and display of genome-wide expression patterns” by Eisen, Spellman, Brown, and Botstein, Proc. Natl. Acad. Sci. USA, Vol. 95, pp. 14863–14868, December 1998

Page 34: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 34 -

Building the “Tree of Life”

http

://en

.wik

iped

ia.o

rg/w

iki/P

hylo

gene

tic_t

ree

http

://us

ers.

rcn.

com

/jkim

ball.

ma.

ultra

net/B

iolo

gyP

ages

/T/T

axon

omy.

htm

l

Scientists build phylogenetic trees in an attempt to understand evolutionary relationships.

(These trees are “best guesses” and certainly contain errors.)

Page 35: Introduction to Molecular Biology Bioinformatics: Issues and Algorithmslopresti/Courses/2007-08/CSE308-408/Lecture… · Introduction to Molecular Biology. CSE 308-408 · Bioinformatics:

CSE 308-408 · Bioinformatics: Issues and AlgorithmsLopresti · Fall 2007 · Lecture 2 - 35 -

Wrap-up

Remember:• Come to class having done the readings.• Check Blackboard regularly for updates.

Readings for next time:• IBA Chapter 2 on algorithms (skim if already familiar).• BB&P Chapter 2 on software (skim if already familiar).