1 bioinformatics algorithms lecture 2 jeff parker, 2009 a bacteriologist is a man whose...

37
1 Bioinformatics Algorithms Lecture 2 © Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea.

Upload: frederick-berry

Post on 18-Jan-2018

215 views

Category:

Documents


0 download

DESCRIPTION

3 Questions What are your goals? Will this course be offered next year? What will the exams be like? Can you go over the probability example again? I didn't understand the Dynamic Programming example Will this get me a job?

TRANSCRIPT

Page 1: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

1

Bioinformatics AlgorithmsLecture 2

© Jeff Parker, 2009

A bacteriologist is a man whose conversation always starts with the germ of an idea.

Page 2: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

2

OutlineReview Website changes

Any problems? BBoard?Review your questionsReview the homework questionsReview Dynamic Programming and string matchNew material

Reading file formats in PythonOpen Reading FramesTurnpike Problem

Page 3: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

3

QuestionsWhat are your goals?Will this course be offered next year?What will the exams be like?Can you go over the probability example again?I didn't understand the Dynamic Programming

exampleWill this get me a job?

Page 4: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

4

(short) AnswersWhat are your goals?

See next slideWill this course be offered next year?

YesWhat will the exams be like?

Solve this problem: suggest an algorithmCan you go over the probability example again?

YesI didn't understand the Dynamic Programming example

I didn't explain it well, and combined two formsLet's try again tonight

Will this get me a job?Not right away

Page 5: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

5

Course GoalsIntroduce an interesting problem from BiologyApply some Computer Science techniquesOur guide will be Jones and Pevzner

I hope to cover all the chaptersIntroduce some topics that are not coveredAdd some problems in probability, which this book does not attempt to cover

My goal is to make the course accessible to students in Biotechnology ProgramLess focus on programming, more on algorithmsProgramming projects will be smaller, more exploratory

Students will pick and hand in a final project on a topic of their choiceApply ideas discussed in class to a problem of their choice

Page 6: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

6

Fibonacci numbers1, 1, 2, 3, 5, 8, 13, ….

def fib(n):if (1 == n):

return 1elif (2 == n):

return 1else:

return fib(n-1) + fib(n-2)

fib(5)

fib(4) fib(3)

fib(3) fib(2)fib(2) fib(1)

fib(2) fib(1)

Page 7: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

7

Ruminationsdef fib(n):

if (1 == n):return 1

elif (2 == n):return 1

else:return fib(n-1) + fib(n-2)

Not well defined for negative numbersHow many calls to compute fib(10)?

fib(5)

fib(4) fib(3)

fib(3) fib(2)fib(2) fib(1)

fib(2) fib(1)

Page 8: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

8

Running timedef fib(n):

if (1 == n):return 1

elif (2 == n):return 1

else:return fib(n-1) + fib(n-2)

Function only returns the value 1 or a sum, so must call for each symbol1 + 1 + 1 + 1 + 1

To compute fib(n) need 2*fib(n) - 1 calls

fib(5)

fib(4) fib(3)

fib(3) fib(2)fib(2) fib(1)

fib(2) fib(1)

Page 9: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

9

Dynamic Programming

def fib3(n):dict = {}for i in xrange(n+1):

if (i < 3):d[i] = 1

else:d[i] = d[i-1] + d[i-2]

Running time?Trade off? 1 1 2 3 5 8 13

Page 10: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

10

Top Down DPFor some problems, we can assemble the information on the flyWe start with an empty dictionary, d

def fib2(n, d):if (n in d):

return d[n]elif ((1 == n) or (2 == n)):

result = 1else:

result = fib2(n-1, d) + fib2(n-2,d)

d[n] = resultreturn result 1 1 2 3 5 8 13

Page 11: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

11

Top Down DPdef fib2(n, d):

print "fib2", n, dif (n in d):

return d[n]elif ((1 == n) or (2 == n)):

result = 1else:

result = fib2(n-1, d) + fib2(n-2,d)

d[n] = resultreturn result

fib2 10 {}fib2 9 {}fib2 8 {}fib2 7 {}fib2 6 {}fib2 5 {}fib2 4 {}fib2 3 {}fib2 2 {}fib2 1 {2: 1}fib2 2 {1: 1, 2: 1, 3: 2}fib2 3 {1: 1, 2: 1, 3: 2, 4: 3}fib2 4 {1: 1, 2: 1, 3: 2, 4: 3, 5: 5}fib2 5 {1: 1, 2: 1, 3: 2, 4: 3, 5: 5,

6: 8}fib2 6 {1: 1, 2: 1, 3: 2, 4: 3, 5: 5,

6: 8, 7: 13}fib2 7 {1: 1, 2: 1, 3: 2, 4: 3, 5: 5,

6: 8, 7: 13, 8: 21}fib2 8 {1: 1, 2: 1, 3: 2, 4: 3, 5: 5,

6: 8, 7: 13, 8: 21, 9: 34}

Page 12: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

12

Approximate Pattern MatchHow can we define how far apart two sequences are?

We start with the Levenshtein distance or edit distanceATCGG-AT-GGA

Smallest number of insertions, deletions, and substitutions requiredBill 1 for each change Satisfies properties for a metric, including the triangle inequality

D(a,c) <= D(a, b) + D(b, c)

Page 13: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

13

Recursive SolutionTo find a match between ATGGA and ATCGGATry all possible actions on first characters, then compare the rest

Match or replace first characters of each stringDrop first char of textDrop first char of pattern

Try to match the remainder using recursionAt each step, at least one string is shorter.

ATGGAATCGGA

TGGATCGGA

TGGAATCGGA

ATGGATCGGA

Page 14: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

14

DotsList the pattern and text as row and column headings.

Place a dot in each cell where row heading and column heading match.We will use this idea in other ways…

Page 15: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

15

Needleman & WunschA T G G

A

T

C

G

A

G

A

--

0

-A

-1

-AT

-2

-ATG

-3

-ATGG

-4

-ATGGA

-5-A

-1-AT

-2-ATC

-3-ATCG

-4-ATCGG

-5-ATCGGA

-6

Global MatchMatch all of two stringsNotice the labelingRules:

Match + 1Missmatch -1Insert/Delete -1

Page 16: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

16

Needleman & WunschA T G G

A

T

C

G

A

G

A

--

0

-A

-1

-AT

-2

-ATG

-3

-ATGG

-4

-ATGGA

-5-A

-1

AA

1

ATA

0

ATGA

-1

ATGGA

-2

ATGGAA

-3-AT

-2

AAT

0-ATC

-3

AATC

-1-ATCG

-4

AATCG

-2-ATCGG

-5

AATCGG

-3-ATCGGA

-6

AATCGGA

-4

Page 17: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

17

Needleman & WunschA T G G

A

T

C

G

A

G

A

--

0

A-

-1

AT-

-2

ATG-

-3

ATGG-

-4

ATGGA-

-5-A

-1

AA

1

ATA

0

ATGA

-1

ATGGA

-2

ATGGAA

-3-AT

-2

AAT

0

ATAT

2

ATGAT

1

ATGGAT

0

ATGGAAT

-1-ATC

-3

AATC

-1

ATATC

1-ATCG

-4

AATCG

-2

ATATCG

0-ATCGG

-5

AATCGG

-3

ATATCGG

-1-ATCGGA

-6

AATCGGA

-4

ATATCGGA

-2

Page 18: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

18

How we build the tableConsider filling in the blank spot in pinkWe have three choicesBuild on pair above, deleting char C

ATG_AT_CCost: 1- 1 = 0

Build on pair on left, inserting char GAT_GATC_Cost: 1 - 1 = 0

Match or replace, using pair from upper leftATGATCCost: 2 – 1 = 0

We only display the winner

ATGATC

1

13

2

T GT AT

AT2

ATGAT_

C AT_ATC

1

1

T GT AT

AT2

ATGAT_

C AT_ATC

1

1

Page 19: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

19

Key IdeaTo compute the best match ending at location [i,j] we compute the three

values below, pick minimal value, and store it in d[i][j]The costs for match, non-match may be varied to match the problem

insertCost = d[i-1][j] - 1;deleteCost = d[i][j-1] - 1;

if (pattern[i] == text[j])matchCost = d[i-1][j-1] + 1;

elsereviseCost = d[i-1][j-1] - 1; ATG

ATC1

13

2

T GT AT

AT2

ATGAT_

C AT_ATC

1

1

Page 20: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

20

DP Approximate Pattern MatchA T G G

A

T

C

G

A

G

A

0 -1 -2 -3 -4 -5

-1

-2

-3

-4

-5

-6

1 0 -1 -2 -3

0 2 1 0 -1

-1 1 0 0 0

-2 0 0 0 0

-3 -1 0 0 0

-4 -2 0 0 0

Page 21: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

21

Needleman & WunschA T G G

A

T

C

G

A

G

A

0 -1 -2 -3 -4 -5

-1

-2

-3

-4

-5

-6

1 0 -1 -2 -3

0 2 1 0 -1

-1 1 1 0 -1

-2 0 2 0 0

-3 -1 1 0 0

-4 -2 0 0 0

Page 22: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

22

Needleman & WunschA T G G

A

T

C

G

A

G

A

0 -1 -2 -3 -4 -5

-1

-2

-3

-4

-5

-6

1 0 -1 -2 -3

0 2 1 0 -1

-1 1 1 0 -1

-2 0 2 2 1

-3 -1 1 3 2

-4 -2 0 2 4

Page 23: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

23

Trace back from lower rightA T G G

A

T

C

G

A

G

A

0 -1 -2 -3 -4 -5

-1

-2

-3

-4

-5

-6

1 0 -1 -2 -3

0 2 1 0 -1

-1 1 1 0 -1

-2 0 2 2 1

-3 -1 1 3 2

-4 -2 0 2 4

Page 24: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

24

Other Pricing SchemesWe may decide that alternative pricing models are betterOne common assumption is that the first deletion is rare

(expensive) but it is much cheaper to continue to deleteAnother model suggests that a Frame Shift (delete by non-

multiple of 3) is more expensive

ATC AT- --T GGT GTT

Can we deal with such functions?

Page 25: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

25

Python File HandlingReading a fileimport string"""Frequency – count frequency of letters in sequence"""fileName = input("Enter file name: ")f = open(fileName, 'r')text = f.read()text = string.replace(text, "\n", "")

# print "Saw ", text

symbolCounts = {}# Go over each letter in the sequencefor x in range(len(text)):

Page 26: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

26

Python File Handling, part deuxReading a large file in one operation may be a bad idea

import string"""Frequency – count frequency of letters in sequence"""fileName = input("Enter file name: ")f = open(fileName, 'r')line = f.readline()while (len(line) > 0):

line = string.replace(line, "\n", "")process(line)line = f.readline()

Page 27: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

27

ProcessingRemove \n and convert to Upper Case

def countLetters(text, counts): """Count the letters in the sequence""" text = string.replace(text, "\n", "") text = text.upper() for x in range(len(text)): ch = text[x] # Increment count if (ch in counts): counts[ch] = counts[ch] + 1 else: counts[ch] = 1

Page 28: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

28

ProblemsChameleons (2.20)

How long will this thing go on? (Probabilities and the stop codon)

Finding an Open Reading Frame (ORF)http://www.genome.gov/25020001

Page 29: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

29

Open Reading Frame

Page 30: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

30

Frame Shifts

Page 31: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

31

Open Reading Frame

Page 32: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

32

Introns and Exons

Page 33: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

33

Alternative Splicing

Page 34: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

34

ProblemHands around the worldEveryone in the world joins hands to form a circle round the

globeEveryone remembers who was on their right hand and left

handGiven a flat file with triples (RightId, Id, LeftId) recover the

sequenceThere are 6 billion triples, so you will want to be efficient

Page 35: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

35

Next WeekPartial Digest Problem

Page 36: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

36

Turnpike distancesExit Description From NYS From Exit251 West Stockbridge Route 41 2.9 47.42 Lee US 20/Route 102 10.6 55.13 Westfield Route 10/US 202 40.4 84.94 West Springfield I-91/US 5 45.7 90.25 Chicopee Route 33 49.0 93.56 Springfield I-291 51.3 95.87 Ludlow Route 21 54.9 99.48 Palmer Route 32 62.8 107.39 Sturbridge I-84 78.5 123.010 Auburn I-290/I-395 90.2 134.710A Millbury Route 146/US 20 94.1 138.611 Millbury Route 122 96.5 141.011A Hopkinton I-495 106.2 150.712 Framingham Route 9 111.4 155.913 Framingham/Natick 116.8 161.314/15 Weston I-95/Route 128 123.3 167.816 West Newton Route 16 125.2 169.717 Newton Washington/ Galen 127.7 172.218/20 Allston/Brighton 130.9 175.421 Back Bay Mass Ave 132.9 177.422 Copley Square MA 9 133.4 177.923 Theater District 133.9 178.424A-B-C South Station 134.6 179.125 South Boston Local streets. 135.3 179.826 Airport Logan Airport 137.3 181.8

Page 37: 1 Bioinformatics Algorithms Lecture 2  Jeff Parker, 2009 A bacteriologist is a man whose conversation always starts with the germ of an idea

37

SummaryThere is a world of interesting problems in BiologyThere is great interest in finding solutions

Computer Science can helpCrucial to keep in touch with Biologists about solutions

Not all simplifications are equally validNot all matches are meaningful

Many Biologists use the new tools in their researchThere is a need for those who understand the algorithms the tools use