computational biology, part a more on sequence operations robert f. murphy copyright 1997, 2001....
TRANSCRIPT
Computational Biology, Part AMore on Sequence OperationsComputational Biology, Part AMore on Sequence Operations
Robert F. MurphyRobert F. Murphy
Copyright Copyright 1997, 2001. 1997, 2001.
All rights reserved.All rights reserved.
Representation of SequencesRepresentation of Sequences
characterscharacters simplestsimplest easy to read, edit, etc.easy to read, edit, etc.
bit-codingbit-coding more compact, both on disk and in memorymore compact, both on disk and in memory comparisons more efficientcomparisons more efficient
Matching one character - with character variablesMatching one character - with character variables Assume two character variables "C” and “Q” Assume two character variables "C” and “Q”
test for test for exactexact match match If(Q=C) {...}If(Q=C) {...}
need complicated statements to handle need complicated statements to handle wildcardswildcards If(Q=C | (Q=‘A’&(C=‘A’|C=‘R’’| C=‘W’ | C=‘M’ | C=‘D’ | C=‘H’’| If(Q=C | (Q=‘A’&(C=‘A’|C=‘R’’| C=‘W’ | C=‘M’ | C=‘D’ | C=‘H’’|
C=‘V’ | C=‘N’)|Q=‘C’&...)) {...}C=‘V’ | C=‘N’)|Q=‘C’&...)) {...}
can build into a can build into a functionfunction If(TestBase(Q,C)) {...}If(TestBase(Q,C)) {...}
Efficient method to match one characterEfficient method to match one character Convert char to int 0-25Convert char to int 0-25 Create 26x26 matrix showing which Create 26x26 matrix showing which
matches whichmatches which Lookup two characters to be compared to Lookup two characters to be compared to
find valuefind value
Bit-codingBit-coding
let the following binary values represent each baselet the following binary values represent each base A="0001A="0001 C="0010C="0010 G="0100G="0100 T="1000T="1000
thenthen G = 4G = 4 A or C = "0011 = 3A or C = "0011 = 3 A,G or T = "1101 = 13A,G or T = "1101 = 13 etc.etc.
Matching one character - with bit codingMatching one character - with bit coding Assume two integer variables “I” and “J”Assume two integer variables “I” and “J”
test for exact matchtest for exact match If(I=J) {...}If(I=J) {...}
test for match with wildcards (no lookup!)test for match with wildcards (no lookup!) If(I&J) {...}If(I&J) {...}
Matching more than one character - pattern matchingMatching more than one character - pattern matching Example: recognition site for a restriction enzymeExample: recognition site for a restriction enzyme
Input sequence string into variable Input sequence string into variable SeqSeq Define Define Site Site as string of characters or masksas string of characters or masks
EcoRI recognizes GAATTCEcoRI recognizes GAATTC AccI recognizes GTMKACAccI recognizes GTMKAC
Create function to search a sequence for that siteCreate function to search a sequence for that site Find(Site,LenSite,Seq,LenSeq)Find(Site,LenSite,Seq,LenSeq) for each position in for each position in SeqSeq, see if , see if SiteSite matches starting there matches starting there
Automating Probability Calculations using Nucleotide Frequencies
Automating Probability Calculations using Nucleotide Frequencies
Automating the CalculationAutomating the Calculation
Goal: Calculate probability of occurrence of a Goal: Calculate probability of occurrence of a sequence that may include ambiguous basessequence that may include ambiguous bases
What we need is a way to consider all What we need is a way to consider all possible allowed nucleotides at each position possible allowed nucleotides at each position in all in all allowedallowed combinations combinations
When using dinucleotide probabilities, have When using dinucleotide probabilities, have to be careful about how the probabilities are to be careful about how the probabilities are combinedcombined
IllustrationIllustration
Question: What is the probability of Question: What is the probability of observing sequence feature observing sequence feature ARTART ( (AA followed by a purine {either followed by a purine {either AA or or GG}, }, followed by a followed by a TT) using dinucleotide ) using dinucleotide probabilities?probabilities?
Which is right?Which is right?
ppARTART=p=pAA(p(p**AAAA+p+p**
AGAG)(p)(p**ATAT+p+p**
GTGT) [eq.1]) [eq.1]
ppARTART=p=pAA(p(p**AAAApp**
ATAT+p+p**AGAGpp**
GTGT) [eq.2]) [eq.2]
ExpansionsExpansions
ppARTART=p=pAA(p(p**AAAA+p+p**
AGAG)(p)(p**ATAT+p+p**
GTGT) [eq.1]) [eq.1]
ppARTART=p=pAApp**AAAApp**
AT AT + p+ pAApp**AAAApp**
GT GT
+ p+ pAApp**AGAGpp**
AT AT + p+ pAApp**AGAGpp**
GTGT))
ppARTART=p=pAA(p(p**AAAApp**
ATAT+p+p**AGAGpp**
GTGT) [eq.2]) [eq.2]
ppART= ART= ppAApp**AAAApp**
AT + AT + ppAApp**AGAGpp**
GTGT
ProofProof
ppARTART=p=pAATAAT+p+pAGTAGT
ppAATAAT=p=pAApp**AAAApp**
ATAT
ppAGTAGT=p=pAApp**AGAGpp**
GTGT
ppART= ART= ppAApp**AAAApp**
AT + AT + ppAApp**AGAGpp**
GTGT
This matches equation 2 on previous slideThis matches equation 2 on previous slide
Need further convincing? Need further convincing?
Imagine that pImagine that p**AAAA=0 and p=0 and p**
GTGT=0 (but all =0 (but all
other pother p** are non-zero) are non-zero) Then pThen pARTART should be zero since there is no should be zero since there is no
way to create either AAT or AGTway to create either AAT or AGT This is predicted by eq. 2 but not by eq. 1This is predicted by eq. 2 but not by eq. 1
More complicated probability illustrationMore complicated probability illustration What is the probability of observing the What is the probability of observing the
sequence feature sequence feature ARYTARYT ( (AA followed by a followed by a purine {either purine {either AA or or GG}, followed by a }, followed by a pyrimidine {either pyrimidine {either CC or or TT}, followed by a }, followed by a TT)?)?
Using equal mononucleotide frequenciesUsing equal mononucleotide frequencies ppAA = p = pCC = p = pGG = p = pTT = 1/4 = 1/4
ppARYTARYT = 1/4 * (1/4 + 1/4) * (1/4 + 1/4) * 1/4 = = 1/4 * (1/4 + 1/4) * (1/4 + 1/4) * 1/4 =
1/641/64
Illustration (continued)Illustration (continued)
Using observed mononucleotide Using observed mononucleotide frequencies:frequencies: ppARYTARYT = p = pAA (p (pAA + p + pGG) (p) (pCC + p + pTT) p) pTT
Using dinucleotide frequencies:Using dinucleotide frequencies: ppARYTARYT = p = pAA (p (p**
AA AA (p(p**ACACpp**
CTCT + p + p**ATATpp**
TTTT) + ) +
p p**AG AG (p(p**
GCGCpp**CTCT + p + p**
GTGTpp**TTTT) )) )
Illustration (continued)Illustration (continued)
Using dinucleotide frequencies:Using dinucleotide frequencies:
A
+A=AA
+G=AG
+C=AAC
+T=AAT
+C=AGC
+T=AGT
+T=AACT
+T=AATT
+T=AGCT
+T=AGTT
A R Y T
Multiply then addMultiply then add
We conclude that for such strings our rule We conclude that for such strings our rule should be “multiply dinucleotide should be “multiply dinucleotide probabilities along each allowed path and probabilities along each allowed path and then add the results”then add the results”
How do we program this?How do we program this?
““forfor” loops?” loops? Nested “Nested “ifif” structure?” structure? Other?Other?
Will this work?Will this work?
result=monoprob(seq(1));result=monoprob(seq(1));for i=2 to nfor i=2 to n
{{temp=0.temp=0.for j=1 to 4 for j=1 to 4 /*for each base*//*for each base*/
{{if(seq(i)&mask(j)) if(seq(i)&mask(j)) temp=temp+diprob(seq(i-temp=temp+diprob(seq(i-1),seq(i))1),seq(i))
}}result=result*tempresult=result*temp}}
A recursive solutionA recursive solution
Some programming languages allow Some programming languages allow recursionrecursion - the calling (invoking) of a - the calling (invoking) of a function by itselffunction by itself
This is useful here because we can This is useful here because we can branchbranch when we encounter an ambiguous base and when we encounter an ambiguous base and consider all alternatives consider all alternatives separatelyseparately
Allows multiplication down the branches Allows multiplication down the branches and then additionand then addition
Site Probability Calculation via RecursionSite Probability Calculation via Recursion Illustration: Make a function that prints out Illustration: Make a function that prints out
all possible sequences that can match a all possible sequences that can match a restriction siterestriction site
(Demo Program PossibleSites.c)(Demo Program PossibleSites.c) (found in (found in
/afs/andrew.cmu.edu/usr/murphy/CompBiol/DemoProgra/afs/andrew.cmu.edu/usr/murphy/CompBiol/DemoPrograms ms or or Mellon: BioServer: Comp. Biol. 03-310: Demo Mellon: BioServer: Comp. Biol. 03-310: Demo Programs: PossibleSites ƒPrograms: PossibleSites ƒ))
PossibleSites.cPossibleSites.c /* PossibleSites.c
Prints out all possible sites that can match a string of IUB codes
January 22, 1997 - R.F. Murphy */
#include <stdio.h> #include <string.h>
void PossibleSites(char SiteString[], int Index);
short Test1(char SiteString[], int Index);
short Test2(char SiteString[], int Index);
short Test3(char SiteString[], int Index);
short Test4(char SiteString[], int Index);
void main(void) { char Site[10]; do { printf("Enter a string of IUB
codes (up to 10 characters): "); scanf("%s", Site); PossibleSites(Site,0); } while (0==0); }
void PossibleSites(char SiteString[], int Index)
{ if (Index>=strlen(SiteString)) { printf("%s\n",SiteString); return; } else { if (Test1(SiteString, Index)) ; else if (Test2(SiteString,
Index)) ; else if (Test3(SiteString,
Index)) ; else if (Test4(SiteString,
Index)) ; else { printf("Illegal character (%c)
encountered\n",SiteString[Index]) ;
PossibleSites(SiteString,Index+1); } } return; }
short Test1(char SiteString[], int Index)
{ /* printf("In Test1: Index %d,
SiteString[Index] %c\n",Index,SiteString[Index]); */
switch (SiteString[Index]) { case 'A': case 'C': case 'G': case 'T': break; default: return false; } PossibleSites(SiteString,Index+1); return true; }
Unwind here
Test for each type of ambiguous base
short Test2(char SiteString[], int Index)
{ char Save;
/* printf("In Test2: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */
Save = SiteString[Index]; switch (SiteString[Index]) { case 'R': SiteString[Index]='A'; PossibleSites(SiteString,
Index); SiteString[Index]='G'; PossibleSites(SiteString,
Index); break; case 'Y': SiteString[Index]='C'; PossibleSites(SiteString,
Index); SiteString[Index]='T'; PossibleSites(SiteString,
Index); break; case 'S': SiteString[Index]='G'; PossibleSites(SiteString,
Index); SiteString[Index]='C'; PossibleSites(SiteString,
Index); break;
case 'W': SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; case 'M': SiteString[Index]='A'; PossibleSites(SiteString, Index); SiteString[Index]='C'; PossibleSites(SiteString, Index); break; case 'K': SiteString[Index]='G'; PossibleSites(SiteString, Index); SiteString[Index]='T'; PossibleSites(SiteString, Index); break; default: return false; } SiteString[Index] = Save; return true; }
short Test3(char SiteString[], int Index)
{ char Save;
/* printf("In Test3: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */
Save = SiteString[Index]; switch (SiteString[Index]) { case 'B': /* not A */ SiteString[Index]='C'; PossibleSites(SiteString,
Index); SiteString[Index]='G'; PossibleSites(SiteString,
Index); SiteString[Index]='T'; PossibleSites(SiteString,
Index); break; case 'D': /* not C */ SiteString[Index]='A'; PossibleSites(SiteString,
Index); SiteString[Index]='G'; PossibleSites(SiteString,
Index); SiteString[Index]='T'; PossibleSites(SiteString,
Index); break;
case 'H': /* not G */ SiteString[Index]='A'; PossibleSites(SiteString,
Index); SiteString[Index]='C'; PossibleSites(SiteString,
Index); SiteString[Index]='T'; PossibleSites(SiteString,
Index); break; case 'V': /* not T/U */ SiteString[Index]='A'; PossibleSites(SiteString,
Index); SiteString[Index]='C'; PossibleSites(SiteString,
Index); SiteString[Index]='G'; PossibleSites(SiteString,
Index); break; default: return false; } SiteString[Index] = Save; return true; }
short Test4(char SiteString[], int Index)
{ char Save;
/* printf("In Test4: Index %d, SiteString[Index] %c\n",Index,SiteString[Index]); */
Save = SiteString[Index]; switch (SiteString[Index]) { case 'N': /* A,C,G,T/U
(iNdeterminate) */ case 'X': /* alternate for N */ SiteString[Index]='A'; PossibleSites(SiteString,
Index); SiteString[Index]='C'; PossibleSites(SiteString,
Index); SiteString[Index]='G'; PossibleSites(SiteString,
Index); SiteString[Index]='T'; PossibleSites(SiteString,
Index); break; default: return false; } SiteString[Index] = Save; return true; }