1 bioinformatics algorithms lecture 2 jeff parker, 2009 a bacteriologist is a man whose...
DESCRIPTION
3 Questions What are your goals? Will this course be offered next year? What will the exams be like? Can you go over the probability example again? I didn't understand the Dynamic Programming example Will this get me a job?TRANSCRIPT
1
Bioinformatics AlgorithmsLecture 2
© Jeff Parker, 2009
A bacteriologist is a man whose conversation always starts with the germ of an idea.
2
OutlineReview Website changes
Any problems? BBoard?Review your questionsReview the homework questionsReview Dynamic Programming and string matchNew material
Reading file formats in PythonOpen Reading FramesTurnpike Problem
3
QuestionsWhat are your goals?Will this course be offered next year?What will the exams be like?Can you go over the probability example again?I didn't understand the Dynamic Programming
exampleWill this get me a job?
4
(short) AnswersWhat are your goals?
See next slideWill this course be offered next year?
YesWhat will the exams be like?
Solve this problem: suggest an algorithmCan you go over the probability example again?
YesI didn't understand the Dynamic Programming example
I didn't explain it well, and combined two formsLet's try again tonight
Will this get me a job?Not right away
5
Course GoalsIntroduce an interesting problem from BiologyApply some Computer Science techniquesOur guide will be Jones and Pevzner
I hope to cover all the chaptersIntroduce some topics that are not coveredAdd some problems in probability, which this book does not attempt to cover
My goal is to make the course accessible to students in Biotechnology ProgramLess focus on programming, more on algorithmsProgramming projects will be smaller, more exploratory
Students will pick and hand in a final project on a topic of their choiceApply ideas discussed in class to a problem of their choice
6
Fibonacci numbers1, 1, 2, 3, 5, 8, 13, ….
def fib(n):if (1 == n):
return 1elif (2 == n):
return 1else:
return fib(n-1) + fib(n-2)
fib(5)
fib(4) fib(3)
fib(3) fib(2)fib(2) fib(1)
fib(2) fib(1)
7
Ruminationsdef fib(n):
if (1 == n):return 1
elif (2 == n):return 1
else:return fib(n-1) + fib(n-2)
Not well defined for negative numbersHow many calls to compute fib(10)?
fib(5)
fib(4) fib(3)
fib(3) fib(2)fib(2) fib(1)
fib(2) fib(1)
8
Running timedef fib(n):
if (1 == n):return 1
elif (2 == n):return 1
else:return fib(n-1) + fib(n-2)
Function only returns the value 1 or a sum, so must call for each symbol1 + 1 + 1 + 1 + 1
To compute fib(n) need 2*fib(n) - 1 calls
fib(5)
fib(4) fib(3)
fib(3) fib(2)fib(2) fib(1)
fib(2) fib(1)
9
Dynamic Programming
def fib3(n):dict = {}for i in xrange(n+1):
if (i < 3):d[i] = 1
else:d[i] = d[i-1] + d[i-2]
Running time?Trade off? 1 1 2 3 5 8 13
10
Top Down DPFor some problems, we can assemble the information on the flyWe start with an empty dictionary, d
def fib2(n, d):if (n in d):
return d[n]elif ((1 == n) or (2 == n)):
result = 1else:
result = fib2(n-1, d) + fib2(n-2,d)
d[n] = resultreturn result 1 1 2 3 5 8 13
11
Top Down DPdef fib2(n, d):
print "fib2", n, dif (n in d):
return d[n]elif ((1 == n) or (2 == n)):
result = 1else:
result = fib2(n-1, d) + fib2(n-2,d)
d[n] = resultreturn result
fib2 10 {}fib2 9 {}fib2 8 {}fib2 7 {}fib2 6 {}fib2 5 {}fib2 4 {}fib2 3 {}fib2 2 {}fib2 1 {2: 1}fib2 2 {1: 1, 2: 1, 3: 2}fib2 3 {1: 1, 2: 1, 3: 2, 4: 3}fib2 4 {1: 1, 2: 1, 3: 2, 4: 3, 5: 5}fib2 5 {1: 1, 2: 1, 3: 2, 4: 3, 5: 5,
6: 8}fib2 6 {1: 1, 2: 1, 3: 2, 4: 3, 5: 5,
6: 8, 7: 13}fib2 7 {1: 1, 2: 1, 3: 2, 4: 3, 5: 5,
6: 8, 7: 13, 8: 21}fib2 8 {1: 1, 2: 1, 3: 2, 4: 3, 5: 5,
6: 8, 7: 13, 8: 21, 9: 34}
12
Approximate Pattern MatchHow can we define how far apart two sequences are?
We start with the Levenshtein distance or edit distanceATCGG-AT-GGA
Smallest number of insertions, deletions, and substitutions requiredBill 1 for each change Satisfies properties for a metric, including the triangle inequality
D(a,c) <= D(a, b) + D(b, c)
13
Recursive SolutionTo find a match between ATGGA and ATCGGATry all possible actions on first characters, then compare the rest
Match or replace first characters of each stringDrop first char of textDrop first char of pattern
Try to match the remainder using recursionAt each step, at least one string is shorter.
ATGGAATCGGA
TGGATCGGA
TGGAATCGGA
ATGGATCGGA
14
DotsList the pattern and text as row and column headings.
Place a dot in each cell where row heading and column heading match.We will use this idea in other ways…
15
Needleman & WunschA T G G
A
T
C
G
A
G
A
--
0
-A
-1
-AT
-2
-ATG
-3
-ATGG
-4
-ATGGA
-5-A
-1-AT
-2-ATC
-3-ATCG
-4-ATCGG
-5-ATCGGA
-6
Global MatchMatch all of two stringsNotice the labelingRules:
Match + 1Missmatch -1Insert/Delete -1
16
Needleman & WunschA T G G
A
T
C
G
A
G
A
--
0
-A
-1
-AT
-2
-ATG
-3
-ATGG
-4
-ATGGA
-5-A
-1
AA
1
ATA
0
ATGA
-1
ATGGA
-2
ATGGAA
-3-AT
-2
AAT
0-ATC
-3
AATC
-1-ATCG
-4
AATCG
-2-ATCGG
-5
AATCGG
-3-ATCGGA
-6
AATCGGA
-4
17
Needleman & WunschA T G G
A
T
C
G
A
G
A
--
0
A-
-1
AT-
-2
ATG-
-3
ATGG-
-4
ATGGA-
-5-A
-1
AA
1
ATA
0
ATGA
-1
ATGGA
-2
ATGGAA
-3-AT
-2
AAT
0
ATAT
2
ATGAT
1
ATGGAT
0
ATGGAAT
-1-ATC
-3
AATC
-1
ATATC
1-ATCG
-4
AATCG
-2
ATATCG
0-ATCGG
-5
AATCGG
-3
ATATCGG
-1-ATCGGA
-6
AATCGGA
-4
ATATCGGA
-2
18
How we build the tableConsider filling in the blank spot in pinkWe have three choicesBuild on pair above, deleting char C
ATG_AT_CCost: 1- 1 = 0
Build on pair on left, inserting char GAT_GATC_Cost: 1 - 1 = 0
Match or replace, using pair from upper leftATGATCCost: 2 – 1 = 0
We only display the winner
ATGATC
1
13
2
T GT AT
AT2
ATGAT_
C AT_ATC
1
1
T GT AT
AT2
ATGAT_
C AT_ATC
1
1
19
Key IdeaTo compute the best match ending at location [i,j] we compute the three
values below, pick minimal value, and store it in d[i][j]The costs for match, non-match may be varied to match the problem
insertCost = d[i-1][j] - 1;deleteCost = d[i][j-1] - 1;
if (pattern[i] == text[j])matchCost = d[i-1][j-1] + 1;
elsereviseCost = d[i-1][j-1] - 1; ATG
ATC1
13
2
T GT AT
AT2
ATGAT_
C AT_ATC
1
1
20
DP Approximate Pattern MatchA T G G
A
T
C
G
A
G
A
0 -1 -2 -3 -4 -5
-1
-2
-3
-4
-5
-6
1 0 -1 -2 -3
0 2 1 0 -1
-1 1 0 0 0
-2 0 0 0 0
-3 -1 0 0 0
-4 -2 0 0 0
21
Needleman & WunschA T G G
A
T
C
G
A
G
A
0 -1 -2 -3 -4 -5
-1
-2
-3
-4
-5
-6
1 0 -1 -2 -3
0 2 1 0 -1
-1 1 1 0 -1
-2 0 2 0 0
-3 -1 1 0 0
-4 -2 0 0 0
22
Needleman & WunschA T G G
A
T
C
G
A
G
A
0 -1 -2 -3 -4 -5
-1
-2
-3
-4
-5
-6
1 0 -1 -2 -3
0 2 1 0 -1
-1 1 1 0 -1
-2 0 2 2 1
-3 -1 1 3 2
-4 -2 0 2 4
23
Trace back from lower rightA T G G
A
T
C
G
A
G
A
0 -1 -2 -3 -4 -5
-1
-2
-3
-4
-5
-6
1 0 -1 -2 -3
0 2 1 0 -1
-1 1 1 0 -1
-2 0 2 2 1
-3 -1 1 3 2
-4 -2 0 2 4
24
Other Pricing SchemesWe may decide that alternative pricing models are betterOne common assumption is that the first deletion is rare
(expensive) but it is much cheaper to continue to deleteAnother model suggests that a Frame Shift (delete by non-
multiple of 3) is more expensive
ATC AT- --T GGT GTT
Can we deal with such functions?
25
Python File HandlingReading a fileimport string"""Frequency – count frequency of letters in sequence"""fileName = input("Enter file name: ")f = open(fileName, 'r')text = f.read()text = string.replace(text, "\n", "")
# print "Saw ", text
symbolCounts = {}# Go over each letter in the sequencefor x in range(len(text)):
26
Python File Handling, part deuxReading a large file in one operation may be a bad idea
import string"""Frequency – count frequency of letters in sequence"""fileName = input("Enter file name: ")f = open(fileName, 'r')line = f.readline()while (len(line) > 0):
line = string.replace(line, "\n", "")process(line)line = f.readline()
27
ProcessingRemove \n and convert to Upper Case
def countLetters(text, counts): """Count the letters in the sequence""" text = string.replace(text, "\n", "") text = text.upper() for x in range(len(text)): ch = text[x] # Increment count if (ch in counts): counts[ch] = counts[ch] + 1 else: counts[ch] = 1
28
ProblemsChameleons (2.20)
How long will this thing go on? (Probabilities and the stop codon)
Finding an Open Reading Frame (ORF)http://www.genome.gov/25020001
29
Open Reading Frame
30
Frame Shifts
31
Open Reading Frame
32
Introns and Exons
33
Alternative Splicing
34
ProblemHands around the worldEveryone in the world joins hands to form a circle round the
globeEveryone remembers who was on their right hand and left
handGiven a flat file with triples (RightId, Id, LeftId) recover the
sequenceThere are 6 billion triples, so you will want to be efficient
35
Next WeekPartial Digest Problem
36
Turnpike distancesExit Description From NYS From Exit251 West Stockbridge Route 41 2.9 47.42 Lee US 20/Route 102 10.6 55.13 Westfield Route 10/US 202 40.4 84.94 West Springfield I-91/US 5 45.7 90.25 Chicopee Route 33 49.0 93.56 Springfield I-291 51.3 95.87 Ludlow Route 21 54.9 99.48 Palmer Route 32 62.8 107.39 Sturbridge I-84 78.5 123.010 Auburn I-290/I-395 90.2 134.710A Millbury Route 146/US 20 94.1 138.611 Millbury Route 122 96.5 141.011A Hopkinton I-495 106.2 150.712 Framingham Route 9 111.4 155.913 Framingham/Natick 116.8 161.314/15 Weston I-95/Route 128 123.3 167.816 West Newton Route 16 125.2 169.717 Newton Washington/ Galen 127.7 172.218/20 Allston/Brighton 130.9 175.421 Back Bay Mass Ave 132.9 177.422 Copley Square MA 9 133.4 177.923 Theater District 133.9 178.424A-B-C South Station 134.6 179.125 South Boston Local streets. 135.3 179.826 Airport Logan Airport 137.3 181.8
37
SummaryThere is a world of interesting problems in BiologyThere is great interest in finding solutions
Computer Science can helpCrucial to keep in touch with Biologists about solutions
Not all simplifications are equally validNot all matches are meaningful
Many Biologists use the new tools in their researchThere is a need for those who understand the algorithms the tools use