kyle jensen's mit ph.d. thesis proposal
DESCRIPTION
This is the presentation I gave for my thesis proposal, sometime in 2001. Obviously, almost all of these ideas failed miserably!TRANSCRIPT
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 1/25
Syntactic Pattern Discovery as a Generic Tool in Systems Biology
Kyle L. Jensen20 December 2001
Or: How I learned to stop worrying and love biology.
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 2/25
Outline
• Introduction– Pattern Discovery– Teireisas
• Proposed Problems– Biological Sequences– Gene Expression and Physiological Data
• Work to Date– Protein Evolution and Scoring Matrices
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 3/25
Part I: Introduction
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 4/25
primitive steams
Pattern Discovery
• Decision-Theoretic
• Syntactic
introduction → pattern discovery
ABCDEF
0 12 13 0 2 1 7 8 9 10integers -
characters -
MSKNIVLLPGDHVGPEVVAamino acids -
ATGAGCATCGATCGATCGAATCTAnucleotides -
Basic Question: When are two events the same?
patterns:
V[HDV].[ST]K
12 . . 1 . 7
TCGATCGA
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 5/25
A Little History
• Formal language theory – pattern recognition
• Biological sequence analysis– Teiresias, Blocks, Emotif, AlignACE, Prosite…– Discovery: functional, structural,
classification
introduction → syntactic pattern discovery → a little history
submedian telocentricprimitives:
a b c d ebabcbabdacad ebabcbab
RP[VI]ILDPx[DE]PT ATCATACTATACGA H…..HRD.K..N Teireisas
serine kinaseAlignACE
yeast promoterProsite
family classifier
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 6/25
lliw, recnac, poleved, elbi, ylbaborp, enummi, eugalp, setebaid, ylekil, otelbitpecsus, kcaj, nhoj, ylbaborpsi, llij, esnopsere, noos, retal, esnopserenummina, sire, polevedyl, recnacote, tonlliw, otenummi, otelbitpecsusylbaborpsi, sikcaj, sirecnac, polevedlli, lliwesnopsere, otylekilsi, setebaidotelbitpecsus, wnhoj, evah, alpo, sinhoj, elbirroh
will, cancer, develop, ible, probably, immune, plague, diabetes, likely, susceptibleto, jack, john, isprobably, jill, eresponse, soon, later, animmuneresponse, eris, lydevelop, etocancer, willnot, immuneto, isprobablysusceptibleto, jackis, canceris, illdevelop, eresponsewill, islikelyto, susceptibletodiabetes, johnw, have, opla, johnis, horrible
recnacotenummiylbaborpsikcaj • recnacotelbitpecsusylbaborpsinhoj • retalsetebaidpolevedylbaborplliwllij dabsirecnac • elbirrohsaweugalp • noosrecnacevahotylekilsikcaj • retalsetebaiddlimdepolevedllij eugalpdlimotelbitpecsusylbaborpsinhoj • wolebtonlliwesnopserenummina • retalpolevedylbaborplliwsetebaid noosrecnacpolevedylekillliwllij • eugalpdabevahlliwnhojretal • setebaidotelbitpecsussawnhoj polevedtonlliwesnopserenummina • enajnipolevedotylekilsirecnac • eugalppolevedlliwkcaj recnacotenummisienaj • setebaidotelbitpecsusebnooslliwkcaj • eugalppolevedlliwylbaborpkcaj elbirrohsirecnac • ylekiltonsiretalesnopserenummina • setebaidotelbitpecsussinhoj recnacpolevedylekilnooslliwnhoj • ylekilebtonlliwsetebaid • tceffenaevahtonlliwrecnac eugalpotenummisillij • elbirroheblliwesnopsereht
jackisprobablyimmunetocancer • johnisprobablysusceptibletocancer • marywillprobablydevelopdiabeteslater cancerisbad • plaguewashorrible • jackislikelytohavecancersoon • marydevelopedmilddiabeteslater johnisprobablysusceptibletomildplague • animmuneresponsewillnotbelow • diabeteswillprobablydeveloplater marywilllikelydevelopcancersoon • laterjohnwillhavebadplague • maryisprobablysusceptibletocancer animmuneresponseislikelytodevelopsoon • jackisprobablyimmunetoplague • johnwassusceptibletodiabetes animmuneresponsewillnotdevelop • cancerislikelytodevelopinjane • jackwilldevelopplague • janeisimmunetocancer jackwillsoonbesusceptibletodiabetes • jackprobablywilldevelopplague • cancerishorrible animmuneresponselaterisnotlikely • johnissusceptibletodiabetes • johnwillsoonlikelydevelopcancer diabeteswillnotbelikely • cancerwillnothaveaneffect • maryisimmunetoplague • therebsponsewillbehorrible
An Illustrative Example
• Patterns in sequences
introduction → syntactic pattern discovery → a quick example
Given sequences:
Strings with 4+ chars occurring 3+ times:…things that occur many times…
John is probably susceptible to cancer.
…find important features……but, what is “important”…
How do we know these are important?
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 7/25
density = 9/19
Teiresias Overview
• Finds patterns in primitive streams – L/W/K patterns
• L = minimum number of primitives in pattern
• L/W = minimum density ( % non-wildcards )
• K = number of times a pattern occursExample Output: 6/15/2 patterns
AFGLYEPC......LHQ.G.ET[ST]NSL.....A....SLKII.KA
LFPCFY wildcarddensity = 6/6
introduction → teiresias → teiresias overview
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 8/25
Teiresias Example
• Finding protein motifs>protein 0MSKNIVLLPGDHVGPEVVAEAVKVLEAVSSAIGVKFNFSKHLIGGASIDAYGVPLSDEALEAAKK>protein 1MSKQILVLPGDGIGPEIMAEAVKVLELANDRFQLGFELAEDVIGGAAIDKHGVP>protein 2MKFLILLFNILCLFPVLAADNHGVGPQGASGVDPITFDINSNQTGPAFLT
All patterns with at least 5 characters, density 5/8, and support 2
TEIRESIAS5/8/2
pattern
GPE..AEAVKVLE
IGGA.ID..GVP
MSK.I..LPGD..GPE
A.D.HGV
location
(0,13) (1,13)
(0,42) (1,42)
(0,00) (1,00)
(1,46) (2,17)
Take away point:Given sequences, Teiresias finds possibly important patterns in them.
introduction → teiresias → teiresias example
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 9/25
Part II: Proposed Problems
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 10/25
Biological Sequences
• Motivation– Protein and DNA sequences– Lots of data
• GenBank > 107 sequences, 1010 nt• Swiss-Prot/TrEMBL nrdb 600,000 proteins
– Natural language metaphor
• Many interesting problems– sequence-structure, molecular evolution,
splicing, gene-finding, alignment
proposed problems → biological sequences
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 11/25
Proposed Problems
• Amino acid scoring matrix design– Model protein evolution using conserved motifs
in protein databases.– Use this model of evolution to design scoring
matrices for homology detection and sequence alignment.
• Oligonucleotide probe design– Predict hybridization kinetics from pattern based
homology– Use these prediction to choose optimal
oligonucleotide probes for DNA mircoarrays
proposed problems → biological sequences → proposed problems
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 12/25
Expression and Physiology
• Motivation– Creating associations: simple observations of
complex biological systems– Indicators for further research
• Association Discovery– Event streams are all the same length– Patterns cannot be shifted– Multiple associations possible, unlike clustering– Sensitive to local similarity and global
proposed problems → expression and physiology → motivation
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 13/25
Association Discovery Example• Heart disease clinical data
– Cleveland study of 500 patients
proposed problems → expression and physiological data → association discovery
63 1 145 233 1 2 150 0 3 0 6 067 1 160 286 0 2 108 1 2 3 3 267 1 120 229 0 2 129 1 2 2 7 137 1 130 250 0 0 187 0 3 0 3 041 0 130 204 0 2 172 0 1 0 3 0
age
sex
blood p
res.
pain ty
pe
chole
s.
blood su
g.
ekg
exer
cise
ekg d
epre
ss.
fluoros
copy
+’s
ekg a
nomaly
#>50% cl
ogged
Patients with type 2 EKG anomaly, with positive fluoroscopy results and high blood pressure are likely to have more than one critically
clogged artery.
Find conserved motifs in the
rows
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 14/25
Proposed Problems
• Linking expression and phenotype– Association discovery
proposed problems → expression and physiological data
23 8 9 14
54 7 16 45
65 45 26 5
15 10 16 1
physiological1 2 3 4
A
B
C
D
samples
2 -1 3 -2
-2 5 5 1
-1 4 3 2
9 7 -2 0
Example associations:“Genes 1 and 4 are associated with pathway ”
or “Up-regulation of genes {4,6,10,…} gives rise to phenotype ”
gene expression1 2 3 4
A
B
C
D
How does the genome relate to the “physiome”?Are there any recurring motifs?
…biological significance?
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 15/25
Part III: Work To Date
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 16/25
Motivation
• The sequence alignment problem– Given a protein sequence, find similar
proteins in a database.
sequence
KSDFKJSDTLKASLDKJFSLDDSLKDJFSKL SKDJFKDKSJDLKLSLKDJLKSJDLLKJDLKSJDKS
database
scoringmatrix
KSDFSDTLKASLDKJFSLDDSLKDJFSKLLKDKSJDLKLSLKDJLKSJDLLKJDLJDKS
KSDFSDDASLDKJFSLKDJFSLKDFJDKKSJDLKLSLKDJLKLKJDLJD
KSDFSDTLKASLDKJFSLDDSLKDJFSKL
LKDKSJDLKL
SLKDJLKSJDLLKJDLJDKS
sequencealignments
But what do we mean by similar?
work to date → aa scoring matrices → motivation
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 17/25
Scoring Matrix Basics
• Describe how we should align proteins– Matrix specifies a score for aligning
each pair of amino acids RKISWMEIYTGEKSTKVYGQDVWLPAETLDLIREYRVAIKGPLTTPVGGGIRSLNVALRQ::: :.:.: :::.:. : .. ::: :::....::.:.:::::::::::: :::::.::RKIEWLEVYAGEKATQMYDSETWLPEETLNILQEYKVSIKGPLTTPVGGGMSSLNVAIRQ
score for K-Q alignment
For detecting homology the matrix should capture evolutionary processes.
…but how do we describe evolution?
Highest score is the “best” alignment.
alignment
A R N D C M E G H I L K Q
ARNKCQE
5 –3 –4 –6 –7 –3 –2 –1 –1 0 –3 –2 –1–1 0 –3 –2 –1 –1 0 –3 –2 –1 –4 –6 –7 5 –3 –4 –6 –7 –3 –2 –1 –1 0 –3 –2 –1–1 0 –3 –2 –1 –1 0 –3 –2 –1 –4 –6 –7 5 –3 –4 –6 –7 –3 –2 –1 –1 0 –3 –2 –1–1 0 –3 –2 –1 –1 0 –3 –2 –1 –4 –6 –7–4 –6 –7 –3 –4 –6 –7 –3
scoring matrix
work to date → aa scoring matrices → scoring matrix basics
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 18/25
Protein Evolution
• A simple model of evolution
work to date → aa scoring matrices → protein evolution
ILHLVGPNGAGKSTLLARMAancestral protein
IVTLIGANGAGKSTLLMTLCMAFLTGHSGAGKSTLLKLICVVVIIGPSGSGKSTLVRCINNIMVVGPSGSGKSTLLRCINVTAFIGPSGCGKTTLLRTFN
MAFLTGHSGAGKSPLLKLIC
VVVIIGPSVSGKSTLVRCINnot functional
…use syntactic pattern discovery to find these conserved motifs.
The distribution of amino acids in the changing positions describes the evolutionary process…
G..G.GK.TL active site
NIMVVGQSGLGKSTLINTLFdescendant proteins
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 19/25
Discovering Patterns
• Example: four ATP-associated proteins>sp•Q07698•ABCA_AERSA ABC transporter protein
MSEPVLAVSGVNKSFPIYRSPWQALWHALNPKADVKVFQALRDIELTVYRGETIGIVGHNGAGKSTLLQLITGVMQPDCGQITRTGRVVGLLELGSGFNPEFTGRENIFFNGAILGMSQREMDDRLERILSFAAIGDFIDQPVKNYSSGMMVRLAFSVIINTDPDVLIIDEALAVGDDAFQRKCYARLKQLQSQGVTILLVSHAAGSVIELCDRAVLLDRGEVLLQGEPKAVVHNYHKLLHMEGDERARFRYHLRQTGRGDSYISDESTSEPKIKSAPGILSVDLQPQSTVWYESKGAVLSDVHIESF
>sp•Q02856•ABCX_ANTSP Probable ATP•dependent transporter MNNRILLNIKNLDVTIGETQILNSLNLSIKPGEIHAIMGKNGSGKSTLAKVIAGHPSYKI TNGQILFENQDVTEIEPEDRSHLGIFLAFQYPVEIPGVTNADFLRIAYNAKRAFDNKEEL DPLSFFSFIENKISNIDLNSTFLSRNVNEGFSGGEKKKNEILQMSLLNSKLAILDETDSG LDIDALKTIAKQINSLKTQENSIILITHYQRLLDYIKPDYIHVMQKGEIIYTGGSDTAMKLEKYGYDYLNK
ATP binding motif G..G.GK[ST]TL was “discovered” in 2500 sequences in SWISS-PROT/TrEMBL.
…how do we construct the scoring matrix?
>sp•P07655•PSTB_ECOLI ATP•BINDING PROTEIN PSTBMSMVETAPSKIQVRNLNFYYGKFHALKNINLDIAKNQVTAFIGPSGCGKSTLLRTFNKMFELYPEQRAEGEILLDGDNILTNSQDIALLRAKVGMVFQKPTPFPMSIYDNIAFGVRLFEKLSRADMDERVQWALTKAALWNETKDKLHQSGYSLSGGQQQRLCIARGIAIRPEVLLLDEPCSALDPISTGRIEELITELKQDYTVVIVTHNMQQAARCSDHTAFMYLGELIEFSNTDDLFTKPAKKQTEDYITGRYG
>sp•P10346•GLNQ_ECOLI ATP•BINDING PROTEIN GLNQGPTQVLHNIDLNIAQGEVVVIIGPSGSGKSTLLRCINKLEEITSGDLIVDGLKVNDPKVDERLIRQEAGMVFQQFYLFPHLTALENVMFGPLRVRGANKEEAKLARELLAKVGLAERAHHYPSELSGGQQQRVAIARALAVKPKMMLFDEPTSALDPELRHEVLKVMQDLAEEGMTMVIVTHEIGFAEKVASRLIFIDKGRIAEDGNPQVLIKNPPSQRLQEFLQHVS
ATP binding signature
Given a database, we can use Teiresias to find the conserved motifs…
work to date → aa scoring matrices → discovering motifs
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 20/25
Patterns to Matrix
• Counting pairs of amino acidsExample Pattern: L..F.L..CI...L
IINSSLWWIIKGPILISILVNFILFICIIRILVQKLRPPDIGSeq A •
LTLITRVGLALSLFCLLLCILTFLLVRPIQGSRTTIHLHLCICLFVGSeq B •
IKTPILVSILRNFILFICIIRILVQKLHSPDVGHNESeq C •
How many AA pairs are there at each position?
pairs1 – VS1 – VR1 • SR pairs
1 – FF2 • LF
Count AA pairs for all patterns and construct a table of pair counts.
A R N D C M E G H I L K Q
ARNKCQE
34 23 43 56 78 32 12 54 76 43 23 21 1112 54 76 43 23 21 11 12 54 76 43 23 2123 43 56 78 32 12 54 76 43 23 21 76 4376 43 23 21 76 43 23 21 76 43 23 21 4567 87 76 43 23 21 12 39 05 37 29 04 2390 76 43 23 21 76 43 23 21 87 76 43 2254 23 54 23 12 64 76 45
AA pair frequency table
work to date → aa scoring matrices → patterns to matrix
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 21/25
Patterns to Matrix
• Make a Log-of-odds matrixodds that a AA pair does not occur by chance
probability of seeing AA pair in our patterns
probability of seeing AA pair by chance=
A R N D C M E G H I L K Q
ARNKCQE
34 23 43 56 78 32 12 54 76 43 23 21 1112 54 76 43 23 21 11 12 54 76 43 23 2123 43 56 78 32 12 54 76 43 23 21 76 4376 43 23 21 76 43 23 21 76 43 23 21 4567 87 76 43 23 21 12 39 05 37 29 04 2390 76 43 23 21 76 43 23 21 87 76 43 2254 23 54 23 12 64 76 45
AA pair frequency table
A R N D C M E G H I L K Q
ARNKCQE
5 –3 –4 –6 –7 –3 –2 –1 –1 0 –3 –2 –1–1 0 –3 –2 –1 –1 0 –3 –2 –1 –4 –6 –7 5 –3 –4 –6 –7 –3 –2 –1 –1 0 –3 –2 –1–1 0 –3 –2 –1 –1 0 –3 –2 –1 –4 –6 –7 5 –3 –4 –6 –7 –3 –2 –1 –1 0 –3 –2 –1–1 0 –3 –2 –1 –1 0 –3 –2 –1 –4 –6 –7–4 6 –7 –3 –4 –6 –7 –3
AA log•of•odds scoring matrix
MATH
positive values mean these pairs are more prevalent in our patterns than by chance……and negative values are less prevalent
Take away point:The evolutionary information contained in the patterns is stored in terms of the scoring matrix.
work to date → aa scoring matrices → patterns to matrix
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 22/25
Basic Idea
KSDFKJSDTLKASLDKJFSLDDSLKDJFSKL SKDJFKDKSJDLKLSLKDJLKSJDLLKJDLKSJDKS
database
TEIRESIASHQ.G.ET..STNSRP..K.TSTP.NSL.S.DF.SLKS.DKISV...EG.A..YPDVELA..YPDVEL.NSEG.A K.T
patternsscoringmatrixMATRIX
ENGINE
Take away point:Given a set of sequences, we use Teiresias to discover important patterns and construct a scoring matrix which captures the way these patterns are evolving.
BDSUM:Bio-Dictionary AA Substitution Matrices
work to date → aa scoring matrices → basic idea
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 23/25
Example Results
• Isocitrate dehydrogenase family– 100 sequences from Prosite PS00470
Experiment: Using each sequence from the family, try to detect the other 99 sequences in the Swiss-Prot/TrEMBL database.
100 0 0
Results:
BDSUM(PS00470)
win loss tie
BLOSUM62(PS00470)
work to date → aa scoring matrices → example results
BLOSUM62(Prosite)30 17 53BDSUM(PS00470)
47 9 44 BLOSUM50(Prosite)BDSUM(PS00470)
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 24/25
Current Work
• Applying to Bio-Dictionary– full SWISS-PROT/TrEMBL
• “Tweaking”– Which pattern classes are
evolutionarily meaningful?– Different “PAM-distance” matrices
• More testing
work to date → aa scoring matrices → current work
…and the oligo probes…
K.L. JENSEN20-Dec-01
BIOINFORMATICS AND METABOLIC ENGINEERING LABORATORY AT MIT
SLIDE 25/25
Acknowledgements
• Dr. Isidore Rigoutsos• Prof. Greg Stephanopoulos
Group members:Mike, Maciek, Bill, Daehee, Jatin, Vipin, Maria, Javier, Maria, Matt, Gary, Saliya, Juan, Angelo, Chris, Dan, Giovanna, Joanne, Hyun-Tae, Patrick, Kyongbum…