“computers are to biology what mathematics is to physics” - harold morowitz

30
“Computers are to Biology what Mathematics is to Physics” - Harold Morowitz Corollaries: 1. A computer scientist who does not understand the subject matter and questions arising from Biological research will very likely not be able to make a significant contribution. 2. A Biologist who does not understand algorithmics and the construction of algorithms will be severely handicapped in doing the Biological Research of the future.

Upload: saad

Post on 11-Jan-2016

44 views

Category:

Documents


0 download

DESCRIPTION

“Computers are to Biology what Mathematics is to Physics” - Harold Morowitz. Corollaries:. A computer scientist who does not understand the subject matter and questions arising from Biological research will very likely not be able to make a significant contribution. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

“Computers are to Biology what Mathematics is to Physics”

- Harold MorowitzCorollaries:

1. A computer scientist who does not understand the subject matter and questions arising from Biological research will very likely not be able to make a significant contribution.

2. A Biologist who does not understand algorithmics and the construction of algorithms will be severely handicapped in doing the Biological Research of the future.

Page 2: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Report of the

National Reseearch Council

National Academy of Sciences

2002

“Biological concepts, models, and theories are becoming more quantitative, and the connections between the life and physical sciences are becoming deeper and stronger”

“Computers now play a central role in the acquisition, storage, analysis, interpretation, and visualization of vast quantities of biological data”

Page 3: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Some Descriptions of Bioinformatics1. (A Computer Scientist, 2005) Bioinformatics is a study of the algorithms and programs

that are used by Molecular Biologists and others in the Biological and Medical Sciences in their quest for understanding protein structure and function in living organisms

2. (Claverie & Notredame, 2003) Bioinformatics is nothing but good, sound, regular biology appropriately dressed so that it can fit into a computer.

3. (Attwood & Parry-Smith, 1999) The current research drive is to be able to understand evolutionary relationships in terms of the expression of protein function. Two computational approaches have been brought to bear on the problem: tackling it from the prespectives of sequence analysis and structure analysis, respectively.

4. (Augen, 2005) Bioinformatics lies at the intersection of Information Science and Molecular Biology. Furthermore, it’s development is highly dependent on simultaneous technical advances in both areas.

5. (Pevsner, 2003) Bioinformatics is designed to help the biologist use computer programs and databases to solve biological problems related to proteins, genes, and genomes with a larger goal of understanding broader issues such as the relationship of structure to function, development, and disease.

6. (Krane & Raymer, 2003) Bioinformatics strives to determine what information is biologically important and to decipher how it is used to precisely control the chemical environment within living organisms.

Page 4: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

In Lab 2 we analysed a single sequence of cDNA. We determined the number of each type of nucleotide making up the sequence and also looked for repeated sequences. This type of analysis is useful for identifying key subsequences, determining the makeup of the sample, and searching for genetically caused diseases. It may also serve to help explain the protein that is manufactured by this sample of DNA,

In Lab 3 we began our consideration of the problem of comparing two DNA samples. This comparison can lead to identification of the query sequence, finding homologous subsequences within the sequence, identifying strongly conserved regions, and establishing an evolutionary relationship between some of the genes contained within the given sample. Unfortunately, the problem is not a simple one. For example, where do we begin the comparison? What about the nucleotides between conserved regions?

Page 5: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

An Analogy

Information is stored on a floppy disc as a sequence of 0’s and 1’s. We may be able to read this code and still not tell what is on the disk.Two discs may contain the same files and programs yet the patterns of 0’s and 1’s may be entirely different.

WHY?Information is not stored as single sequential files. As files become written and erased, new files are placed in ‘holes’ and maybe distributed in several of these available spaces.

Page 6: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Sequence Comparison

A segment of a blast report:

Score: 76

What does this number mean?

Where did it come from?

How good a score is this?

We will back off a little and work up to answering these questions.

Page 7: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

First Attempt at Alignment – Dot Plots

Basic Idea –

1. Form a rectangle.

2. Place the first sequence along the right hand side of the rectangle and the second along the top.

3. Start with the character at the top of the first sequence place a dot in the rectangle for every character in the second sequence that matches that character from the first sequence.

4. Examine the result to determine if there are sequences of consecutive characters that match.

5. If there are such sequences, align them.

6. Use mismatches and/or gap to match as many matching sequences as possible

Page 8: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Well, That’s the Idea, But …..The following is the exact implementation of the previous slide. Recall black pixels mean that there was a match. What have you learned from the plot about the sequence alignment?

Page 9: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

The two sequences that were aligned were:

Escherichia coli O157:H7 mutS

And

Shigella boydii strain 374(S37) phosphoprotein phosphatase B (prpB) gene, partial sequence

The second is a well-conserved homolog of a subsequence of the first. If the Dot Plot method is to be of any value, it should make this fact obvious.

The plotting method is modified by using a “sliding window” of some specified length. If the window is of length n the point is only plotted if the character and its next n – 1 characters match.

What is a good value for n? There is no consensus. Some other methods use as the default value of 11 since (1/4)11 = .000000238… Which would be the probability that the match of 11 characters would occur at random. The Dot Plot Tool that is part of Mol Kit at Colorado State uses n = 9.

Page 10: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Here is the Dot Plot with n = 11:

In this case the relationship is obvious. So the moral of the story is that it is not just the tool, it is how skilled you are at using it.

Page 11: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Let’s do something a little more simple That may also involve some gaps, i.e., indels. How about the sequencesACACTG and ACACTGATCG ? GACGGATTAG and GATCGGAATAG ?

Looks like an alignment of ACACTG- - - - ACACTGATCGIs best. Gaps are used to fill out the end of the sequence.

One possibility GA- CGGATTAG GATCGGAATAGNote without the gap nothing after GA would match up.

Page 12: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

In the last two examples the gaps played roles that are significantly different

In the first: ACACTG- - - - ACACTGATCGthe gaps probably denote that the quite possibly the first sequence was terminated prematurely

In the second: GA- CGGATTAG GATCGGAATAGthe gap denotes either an insertion or a deletion. These are called an “indel” ‘s. Note there is also a possible mutation four characters from the end.

Page 13: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Dealing with Gaps

Consider the two sequences: AATCTATA and AAGATA

These sequences have different lengths – 8 versus 6

If we insist that the sequences do not contain any gaps, i.e. are ungapped, then there are only 3 ways to align them. AATCTATA AATCTATA AATCTATA AAGATA AAGATA AAGATA

However, if we decide to put gaps in the smaller sequence to have the two sequence sizes match up. We now have 8 spaces to place the two gaps (note they may be separated). This means there are 8 choose 2 ( C8,2) or 28 possible placements for the gaps and, thus, 28 different possible alignments. Note: Cn,k=n!/(k!*(n-k)!) where n! = n*(n-1)*(n-2)*….*3*2*1.

If we were to in addition allow 3 gaps in the top sequence, we would now have C11,3*C11,5 = 76,230 candidates for alignments.

The inclusion of gaps makes the problem much larger. But, it is an unavoidable necessity.

Page 14: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Three of the 28 possible alignments with gaps allowed in the shorter sequence. AATCTATA AATCTATA AATCTATA AAG- AT -A AAG- -ATA AAGATA- -When scoring these alignments we need to assign a “penalty” for these gaps or ‘indels’.

One possibility is to give a -1 for each gap character in the alignment.

Another is to consider starting a gap to be more serious than allowing a gap to continue, i.e. G- AT -A is considered a more serious difference than G- -ATA

For example, we might assess a penalty of -5 for starting a gap and only -1 for continuing it. In the above example the first instance would have a penalty of -10 and the second, -6.

Page 15: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Overall Scoring of Sequence Alignments

Page 16: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Strategy 1

The Identity Matching Scheme

This is essentially a “no penalty for mismatches” scheme

A T C G

A 1 0 0 0

T 0 1 0 0

C 0 0 1 0

G 0 0 0 1

Penalty for indel = -1

Scores for our previous alignment candidates: AATCTATA AATCTATA AATCTATA AAG- AT -A AAG- -ATA AAGATA- -

1+1+0-1+0+0 -1+1 = 1 1+1+0-1-1+1+1+1= 3 1+1+0+0+1+1-1-1= 2

Or, if we do not assign a penalty for ending gaps 1+1+0+0+1+1 = 4

Page 17: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Discussion of Strategy 1

• The second alignment has the highest score

• Problem is: what does this score represent?

• There is no recognition of the quality of the mismatch

• There is no reference to the history of the two sequences taken in to account. We just have assigned numbers

• May not be at all appropriate for a situation where we have a quickly mutating organism, such as a virus. In such an organism substitutions are common – given site may undergo several changes.

Page 18: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Strategy Due to Jukes/CantorMake the assumption that any nucleotide is equally likely to change into any of the other nucleotides.

A T C G

A 5 -4 -4 -4

T -4 5 -4 -4

C -4 -4 5 -4

G -4 -4 -4 5

Jukes and Cantor made no provision for indels. Others have added a gap penalty, but no consensus as to what that should be. We will arbitrarily choose -8.

Scoring the three alignments: AATCTATA AATCTATA AATCTATA AAG- AT -A AAG- -ATA AAGATA- - 5+5-4-8-4-4-8+5 = -13 5+5-4-8-8+5+5 = 0 5+5-4-4+5+5-8-8 = -4

OR if we ignore gaps at the end of the sequence: 5+5-4-4+5+5 = 12

Page 19: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Discussion of Jukes/Cantor

• The scoring of the alignments yielded exactly the same ranking as was the case for the identity matrix score

• The “quality” of the scores is different – It is based on observations of frequency data of “known” sequences.

• Assumption that all mismatches are equally likely is not supported by our previous lectures or by examining larger, more recent, databases of known sequences where ancestral data is known.

• No standard way to choose a gap penalty.

Page 20: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

The Kimura Two Parameter Model

Based on assigning different probabilities to different types of matches, i.e., purine – purine, pyrimidine - pyrimidine as opposed to purine – pyrimidine.

A T C G

A 1 -5 -5 -1

T -5 1 -1 -5

C -5 -1 1 -5

G -1 -5 -5 1

Once again, the issue of gaps is decided arbitrarily. We will use -8, as in the previous case

Scoring the three sequences once more: AATCTATA AATCTATA AATCTATA AAG- AT -A AAG- -ATA AAGATA- - 1+1-5-8-5-5-8+1 = -28 1+1-5-8-8+1+1+1 = -16 1+1-5-5+1+1-8-8 = -20

OR 1+1-5-5+1+1 = -6

Page 21: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Discussion of the Kimura Matrix

• More realistically set up – uses more recent (1980) database information to calculate frequency probabilities and assign scoring values.

• Seems to have a propensity towards negative scores

• Still no resolution of the “proper” weighting for gaps

• Note the rankings of the three sequences still maintained the same order.

• Again, the issue is the quality of the rankings.

Page 22: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

No matter what scoring scheme is used the basic problem is simply this: Given two strands of DNA, how can we find the best possible alignment for these two sequences?

We need to follow certain rules:

• Must use a “sensible” scoring matrix and gap penalty

• The order of the characters in the sequence can not be changed, but gaps can be inserted between them.

• Gaps appear in either of the sequences.

• The score for the pairing of two characters, one from each sequence or a gap with a character from the other sequence, must be done so that the score to that point is the best possible score.

Page 23: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

How do we implement the last rule?

• In Mathematics this is called “The Principle of Optimality” or POO (The method of problem solving using POO is called Dynamic Programming).

• How can the score for a particular position within the alignment be calculated?

We can pair the two available characters

This adds the score for a match or mismatch to the previous score

We can pair the character in the first string with a gap

We can pair the character in the second string with a gap

Either of these add the gap penalty to the previous score

Scorei = Scorei-1 + s(a,b)

Scorei = Scorei-1 +s(a,-)

Scorei = Scorei-1 + s(-,b)

Page 24: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

The Three Ways To Score A Position

G T A

C * * *

T * *

First make a table with the first sequence along the right and the second along the top. Calculate the three possible values for each cell. Choose the largest of these values.

Gap in second sequence

Gap in first sequence

Pair the two symbols

Page 25: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Let’s try one

A C T C G

0 -1 -2 -3 -4 -5

A -1

C -2

A -3

G -4

T -5

A -6

G -7

Scoring Scheme: Match = 1, Mismatch = 0, Gap = -1

The negative numbers along the top and down the right are the gap penalties for leading gaps in either of the sequences.

Page 26: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Calculating a Cell Value

A C T C G

0 -1 -2 -3 -4 -5

A -1 1

C -2

A -3

G -4

T -5

A -6

G -7

Scoring Scheme: Match = 1, Mismatch = 0, Gap = -1

-2

-21

The negative numbers along the top and down the right are the gap penalties for leading gaps in either of the sequences.

Page 27: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Filling in the row and column

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2 0

A -3 -1

G -4 -2

T -5 -3

A -6 -4

G -7 -5

Scoring Scheme: Match = 1, Mismatch = 0, Gap = -1

Indicate the arrow that gave the highest score for each cell. Dynamic Programming records the best score and where it came from. This will be important when the table is filled

Page 28: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

The Complete Matrix

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2 0 2 1 0 -1

A -3 -1 1 2 1 0

G -4 -2 O 1 2 2

T -5 -3 -1 1 1 2

A -6 -4 -2 0 1 1

G -7 -5 -3 -1 0 2

Scoring Scheme: Match = 1, Mismatch = 0, Gap = -1

Page 29: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Dynamic Programming Finds the Best Score and the Corresponding Alignment

A C T C G

0 -1 -2 -3 -4 -5

A -1 1 0 -1 -2 -3

C -2 0 2 1 0 -1

A -3 -1 1 2 1 0

G -4 -2 O 1 2 2

T -5 -3 -1 1 1 2

A -6 -4 -2 0 1 1

G -7 -5 -3 -1 0 2

Alignment: Start in lower right corner and work backwards: AC- - TCG

ACAGTAG

Page 30: “Computers are to Biology what Mathematics is to Physics”      -  Harold Morowitz

Rules to Discover

The Alignment

1. Start in the lower right box – this box contains the best alignment score for the two sequences relative to this particular scoring scheme. NOTE: This may NOT be the largest value in the table, but it is the best score for completely aligning the two sequences. All other scores in the table are for partial alignments of the sequences.

2. Work backwards following the arrows from the present box in reverse order.

3. Diagonal arrow is a pairing of the characters

4. Vertical arrow represents a gap in the sequence across the top

5. Horizontal arrow represents a gap in the sequence along the side.