kyle jensen mit ph.d. thesis defense
TRANSCRIPT
Motif discovery in sequential data
Kyle Jensen
Thesis OffenseDepartment of Chemical EngineeringMassachusetts Institute of Technology
Thesis committee:
Greg StephanopoulosWilliam GreenRobert BerwickIsidore Rigoutsos
ChE, MITChE, MITEECS, MITIBM
Sequencing throughput, like processor power, is growing exponentially
As a result, Genbank is overflowing
Anatomics Biomics ChromosomicsCytomicsEnviromics Epigenomics Fluxomics GlycomicsGlycoproteomicsImmunogen. Immunomics ImmunoproteomicsIntegromics Interactomics Ionomics LipidomicsMetabolomics Metabonomics Metagenomics MetallomicsMetalloproteomicsMethylomics Mitogenomics NeuromicsNeuropeptido. OncogenomicsPeptidomicsPhenomicsPhospho-prot. PhosphoproteomicsPhysiomics PhysionomicsPostgenomicsPostgenomics Pregenomics RnomicsSecretomics SubproteomicsSurfaceomicsSyndromicsTranscriptomics
And the ome-ome keeps growing
Together, these data form a rich network of information
CTTCATCAATTATCGTACTCTTGTTAATGTGGTAAAATATAAACTGGACCACATGAGAAGAAGAATTGAGACCGATGAGAGAGATTCGACCAACCGGGCTTCCTTCAAATGTCCTGTCTGTAGTAGTACTTTCACAGACTTAGAAGCTAATCAGCTCTTTGATCCTATGACAGGAACTTTCCGCTGTACTTTTTGCCATACAGAGGTAGAAGAGGATGAATCAGCAATGCCCAAAAAAGATGCACGCACACTTTTGGCAAGGTTTAATGAACAAATTGAGCCCATTTATGCATTGCTTCGGGAGACAGAGGATGTGAACTTGGCCTATGAAATACTTGAGCCAGAACCCACAGAAATCCCAGCCCTGAAACAGAGCAAGGACCATGCAGCAACTACTGCTGGAGCTGCTAGCCTAGCAGGTGGGCACCACCGGGAAGCATGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGAGAAGGGGGCATAGATATGGACGCATTTCAGGAGC
This data glut motivates the need for automated methods of discovery and analysis
Here, I focus on motif discovery
in sequential data using a linguistic metaphor
S NP VPNP1 D NP | PNNP2 ADJ NP | NVP V NPD a | thePN peter | paul | maryADJ large | blackN dog | cat | horseV is | likes | hates
A grammar is a mathematical system for describing the structure of a language
GRAMMAR
S NP VPNP D NP | PNNP ADJ NP | NVP V NPD a | thePN peter | paul | maryADJ large | blackN dog | cat | horseV is | likes | hates
S => NP VP => PN VP => mary VP =>mary V NP => mary hates NP =>mary hates D NP1 =>mary hates the NP1 =>mary hates the N => mary hates the dog
S => NP VP => NP V NP =>NP V D NP1 => NP V a NP1 =>NP V a ADJ NP1 =>NP is a ADJ NP1 =>NP is a ADJ ADJ NP1 =>NP is a large ADJ NP1 =>NP is a large ADJ N =>NP is a large black N =>NP is a large black cat=>PN is a large black cat =>peter is a large black cat
Grammars can describe biological phenomena in the same manner as natural languages
Two examplesExample: a declarative sentence in English
Example: eukaryotic gene structure
S
D N
NP V A P NP
D N
the
boy
is
upset
over
the
girl
the
advisor
is
pleased
with
the
research
S NP V A P NP
NP
{
D NN
gene
startcodon
upstream
primarytranscript
TATA box
exon
intron
exon
stopcodon
ATGACTGACTGATCGATCGATCGATCGATGATCGTACGATCGATGCATCGATCGATCGATCGATCGA
Grammars are suitable for describing any complex arrangement of sequential data
The grammar of biological sequences
language
grammar
linguisticexample
biologicalexample
complexity
Simple, regular grammars are compactly written as regular expressions
[LIVF].........[LIV][RK].(9,20)WS.WS....[FYW]
Motif discovery is the inverse problem: given the sentences, find the grammar
CTTCATCAATTATCGTACTCTTGTTAATGTGGTAAAATATAAACTGGACCACATGAGAAGAAGAATTGAGACCGATGAGAGAGATTCGACCAACCGGGCTTCCTTCAAATGTCCTGTCTGTAGTAGTACTTTCACAGACTTAGAAGCTAATCAGCTCTTTGATCCTATGACAGGAACTTTCCGCTGTACTTTTTGCCATACAGAGGTAGAAGAGGATGAATCAGCAATGCCCAAAAAAGATGCACGCACACTTTTGGCAAGGTTTAATGAACAAATTGAGCCCATTTATGCATTGCTTCGGGAGACAGAGGATGTGAACTTGGCCTATGAAATACTTGAGCCAGAACCCACAGAAATCCCAGCCCTGAAACAGAGCAAGGACCATGCAGCAACTACTGCTGGAGCTGCTAGCCTAGCAGGTGGGCACCACCGGGAAGCATGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGGGCCACCAAAGGTCCTTCCTATGAAGACTTATACACTCAGAATGTTGTCATTAACATGGATGACCAAGAAGATCTTCATCGAGCCTCACTGGAAGGGAAATCTGCCAAAGAGAGGCCTATTTGGTTGAGAGAAAGCACTGTCCAAGGGGCATATGGTTCTGAAGATATGAAAGAAGGGGGCATAGATATGGACGCATTTCAGGAGCGTGAGGAAGGCCATGCTGAGAAGGGGGCATAGATATGGACGCATTTCAGGAGC
Part 1:Rational design of antimicrobialpeptides using linguistic methods
Antimicrobial peptides are small proteins that attack and kill bacteria
Functional characteristics:Part of innate immune systemall multicellular eukaryotes
Attack bacterial membraneelectrostatic attraction
effective at g/mL concentrations
Applications of AmPs:Novel class of antibioticslow bacterial resistance
activity against MDR pathogens
currently topical: acne, etc.
Other clinical applicationsAIDS, certain cancers, biodefense
AmPs
bacterialmembrane
+
+
-
-
AmP sequences contain many repeated motifs, suggesting a linguistic model
AmP amino acid sequences~1000 natural AmP sequencesfrom many different species
Numerous conserved motifssuggest rules for building AmPs
similar to grammar of languages
cecropins
cecropin motif
The language of AmP sequencesCan we find the underlying grammar of this language?Will this grammar capture the sequence/function relationships?
Knowing the grammar, can we build novel AmPs?
The AmP sequences were modeled using simple regular grammars
Given a language, is there a regular grammar?Example: the cecropin sub-sequences
Automated grammar induction: TeiresiasRegular grammars of the form
R: Vi Vj where (type A, aa) or ={} (type B, wildcard)
Find all G for which a/b > w, and a+b>L
Subject to maximal |R| and maximal occurrences of G
G = (V, , R, S)
where
seq1: QSEAGWLKKLGKseq2: QSEAGWLRKAAKseq3: QTEAGGLKKFGK
What grammar describes these sequences?
V= non-terminal symbols= amino acidsR= set of replacement rulesS= starting amino acid
cecropin motif: Q.EAG.L.K..K
Our goal was to use this linguistic model to design novel AmPs
Protein design space is combinatorially large20N possible N amino acid sequencesN = 18, number of stars in universe
N = 50, number of atoms in Earth
N = 100, number of electrons in universe
Why design novel AmPs?Concern over RamPsCross-resistance
Other approachesFolding & thermodynamics
Combinatorial libraries
sequencespace
grammaticalspace
naturalAmPs
trueAmPs
We used Teiresias to discover ~700 grammars defining the language of AmPs
query:
- grammar 1
grammar 2 -
These grammars were used to design novel AmPsNo more than 5-in-a-row with natural AmPs
12 million grammatical sequences
40 novel AmPs were chosen for experimental validation
Tested against B. subtilis & E. coli
serial dilutions
replicates
9 non-AmPs9 natural AmPsControl42 shuffled42 motif-basedTestNY
Expect Activity?
Our results show significant enrichment for activity in the designed set
Expected Activity?
Y
N
Test
42 motif-based18 / 42
42 shuffled2 / 42
Control
9 natural AmPs6 / 9
9 non-AmPs0 / 9
Optimized leads showed strong activity against anthrax and staph
Part 2:A generic motif discovery algorithmfor diverse biomolecular data
Motif discovery is the automated search for similar regions in streams of data
Un-sequential dataNo ordering
Sequential dataA natural ordering of the dataNucleotide and amino acid sequences
Stock prices, protein structures
MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA
A motif is just a collection ofmutually similar regions in thedata stream
There are two classes of motif discovery tools commonly used for sequence analysis
Exhaustive regular-expression based toolsTeiresias
Pratt
Descriptive position weight matrix-based toolsGibbs sampler
MEME
Consensus
TGCTGTATATACTCACAGCAAACTGTATATACACCCAGGGTACTGTATGAGCATACAGTAACCTGAATGAATATACAGTATACTGTACATCCATACAGTATACTGTATATTCATTCAGGTAACTGTTTTTTTATCCAGTAATCTGTATATATACCCAGCTTACTGTATATAAAAACAGTA
CT[AT].[GT]....A..CAG
Gemoda was designed to be exhaustive and have descriptive power
Gemoda exhaustively returns maximal motifsUses convolution of TeiresiasWay of stiching together smaller patterns combinatorially
Gets descriptiveness from similarity metricGeneric, context dependent definition of similarity
MLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFALKNNRKSVCAVHKANIMKLGDGLFRNTVNEIGANEYPELDVKNIIVDNASMQAVAKPHQFDVLVTPNLYGSILGNIGSALIGGPGLVPGANFGREYAVFEPGSRHVGLDIKGQNVANPTAMILSSTLMLRHLGLNAYADRISKATYDVISEGKSTTRDIGGSASMLRQGIAAQKKSFATLAAEQLLPKKYGGRYTVTLIPGDGVGKEVTDSVVKIFENENIPIDWETIDISGLENTENVQRAVESLKRNKVGLKGIWHTPADQTGHGSLNVALRKQLDIFANVALFKSIPGVKTRLNNIDMVIIRENTEGEYSGLEHESVPGVVESLKIMTRAKSERIARFAFDFA
F(w1, w2) = square error
F(w1, w2) = aa scoring matrix
Gemoda proceeds in three steps: comparison, clustering, and convolution
The comparison stage is used to map the pairwise similarities between all windows in the data streams
Creates an distance matrixDoes an all-by-all comparison of windows in the data
Comparison function is context-specific
F(w1, w2)
The clustering phase is used to find groups of mutually similar windows
Different clustering functions have different usesClique-finding is provably exhaustive
K-means and other methods are faster
Output clusters become elementary motifs which are convolved to make longer, maximal motifs
The convolution phase is used to stitch together the clusters into maximal motifs
The motifs should be as long as possible, without decreasing the support
elementarymotifs(clusters)
windowordering
Here we show a few representative ways in which Gemoda can be used
Motif discovery in...
Protein sequences(ppGpp)ase enzymes & finding known domains
DNA sequencesThe LD-motif challenge problem
Protein structuresConserved structures without conserved sequences
Gemoda can be applied to amino acid sequences as well
Example: (ppGpp)ase family from ENZYME databaseGuanosine-3',5'-bis(diphosphate) 3'-pyrophosphohydrolase enzymesEC 3.1.7.2
Ave. length ~700 amino acids
8 sequences from 8 species
Searched using GemodaMinimum length = 50 amino acids
Minimum Blosum62 bit score = 50 bits
Minimum support = 100% (8/8 sequences)
Clustering method = clique finding
Can Gemoda find this known motif?
How sensitive is Gemoda to noise?
(ppGpp)ase example: the comparison phase shows many regions of local similarity
Dots indicate 50aa windows that are pairwise similar
Streaks indicate regions that will probably be convolved into a maximal motif
(ppGpp)ase example: the clustering phase shows elementary motifs conserved between all 8 enzyme sequences
(ppGpp)ase example: the final motifs match the known rela_spot domain and the HD domain from NCBI's conserved domain database
Maximal motif (one of three, ~100 aa in length)
This particular cluster represents the first set of 8 50aa windows in the above motif.
Results are insensitive to noise
The LD-motif problem models the subtle binding site discovery problem
GACTCGATAGCGACG
Sequence #1: ATGATGAGTCTATTGCGCCGCGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG...
Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCTCTCGATTGCGACTTTCGACTAGCTA...
Sequence #3: ATGTACTACGAGTCTCCATAGCGTTGCTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT...
Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGACGACTCGTGGGCGGCG...
...
Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACTGATTCGTAAGGGACGATAGCTACTATCTTATTCGACTAGTACGACT...
Gemoda can solve both the LD-motif problem and a more generalized version of the same
GGGACTCGATAGCGACGCCG
Sequence #1: ATGATGAGTCTATTGCGCCGCGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG...
Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA...
Sequence #3: ATGTACTACGAGTCTCCATAGCGTTGCTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT...
Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGACGACTCGTGGGCGGCG...
...
Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACTGATTCGTTAGGGACGATAGCTACTATCTTATTCGACTAGTACGACT...
Total motif length?
Gemoda can solve both the LD-motif problem and a more generalized version of the same
GACTCGATAGCGACG
X
All sequences?
Sequence #1: ATGATGAGTCTATTGCGCCGCGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG...
Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA...
Sequence #3: ATGTACTACGAGTCTCCATAGCGTTGCTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT...
Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGACGACTCGTGGGCGGCG...
...
Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACTGATTCGTTAGGGACGATAGCTACTATCTTATTCGACTAGTACGACT...
Gemoda can solve both the LD-motif problem and a more generalized version of the same
GACTCGATAGCGACG
Number of mutations?
Sequence #1: ATGATGAGTCTATTGCGCCGCGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG...
Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA...
Sequence #3: ATGTACTACGAGTCTCCATAGCGTTGCTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT...
Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTACGACTATAGCTACTACGACTATAGCTATCTTATTCGACGACTCGTGGGCGGCG...
...
Sequence #m: ATGCTACTATCTTATTCGACTAGTACGACTATAGCTACTGATTCGTTAGGGACGATAGCTACTATCTTATTCGACTAGTACGACT...
Gemoda can solve both the LD-motif problem and a more generalized version of the same
GACTCGATAGCGACG
Sequence #1: ATGATGAGTCTATTGCGCCGCGATCAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGATCTATCTATCAG...
Sequence #2: ATGAGCTAGCTAGCTACTATCTTATTCGACTAGTACGACTACGTACTACGAATCAGCTCGATCGCTAGCTTTTAAATCTCTTCGACTAGCTA...
Sequence #3: ATGTACTACGAGTCTCCATAGCGTTGCTCTATCTATCAGTACTACGACTCGTCGACTAGCTAGCTGACTCTATCTATCAGGATTT...
Sequence #4: ATGACTATAGCTACTATCTTATTCGACTAGTATATCTGGTTCGACTTAGCTATCTATTCGACGACTCGTGGGCGGCG...
...
Sequence #m: ATGCTACTATCTTATTCGACTGAGTACGACTATAGCTACTGATTCGTTAGGGACGATAGCTACTATGACTAGTGACT...
Number of unique motifs?
Gemoda can also be applied to protein structures
Treat protein structure as alpha-carbon traceSeries of x,y,z coordinates
Use a clustering function that compares x,y,z windowsRoot mean square deviation (RMSD)
unit-RMSD
x1y1z1
x2y2z2
x3y3z3
...........................
xMyMzM
Protein structure example: human FIT vs. uridylyltransferase
fin
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level
Slide /48
unrestrictedany???context-sensitiveZb aXDutchRNA psuedo-knotscontext-freeZ aXySwiss-GermanRNA hairpin loopright-linearZ aXEnglish phonologyATP-binding motif
???Page ??? (???)05/09/2006, 08:02:46Page /