cap 5510: introduction to bioinformaticsgiri/teach/bioinf/s07/lecx.pdf · cap 5510: introduction to...
TRANSCRIPT
![Page 1: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/1.jpg)
2/8/07 CAP5510 1
CAP 5510: Introduction to Bioinformatics
Giri NarasimhanECS 254; Phone: x3748
[email protected]/~giri/teach/BioinfS07.html
![Page 2: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/2.jpg)
2/8/07 CAP5510 2
Pattern Discovery
![Page 3: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/3.jpg)
2/8/07 CAP5510 3
Patterns
Nature stumbles upon recipes to accomplish tasks.With high probability, such recipes are reused. This causes the recipe to be conserved through evolution. Such recipes give rise to patterns.
![Page 4: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/4.jpg)
2/8/07 CAP5510 4
Why Pattern Discovery?
Modern Biomedical ResearchGenerates a “ton of data”.Use analytical tools to find patterns in data.
Pattern Discovery facilitates this process!Pattern Discovery in sequencesPattern Discovery in structures Pattern Discovery in quantitative data
Patterns help to detect members of a classPatterns help to characterize classes
![Page 5: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/5.jpg)
2/8/07 CAP5510 5
Sequence Patterns: Examples
Protein active sites and functional domainsFor e.g., Zinc-finger motifs & Helix-turn-helix motifs
Protein family signaturesSignals in DNA e.g., protein binding sites MicroRNA and Anti-sense RNA
![Page 6: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/6.jpg)
2/8/07 CAP5510 6
Example 1: Protein Motifs
DNA-binding motifsHelix-turn-Helix
Motifs in Cys2His2-Zinc-binding proteins
Motifs in proteins that bind to [4Fe-4S]-complex
Example: Zinc Finger Motif…YYKCCGLCCERSFFVEKSALLSRHHORVHHKN…
3 6 19 23
Example: Ferredoxin subfamily…CCxxCCxxCCxxxCPCP…
![Page 7: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/7.jpg)
2/8/07 CAP5510 7
How to Represent Patterns
Consensus sequenceAlignmentsLOGO formatFrequency MatricesWeight Matrices (Profiles, PSSMs, PWMs)
![Page 8: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/8.jpg)
2/8/07 CAP5510 8
Pattern Representations
Consensus sequences[Pribnow, 1975]TACGATTATAATTATAATGATACTTATGATTATGTT------TATAAT Consensus
TATRNT Consensusw/ IUPAC
TATAAT Multi-levelG CGC Consensus
T
Needs Alignment
![Page 9: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/9.jpg)
2/8/07 CAP5510 9
Pattern Representations
Consensus sequencesWeight Matrices (Profiles, PSSMs)
Frequency CountsRelative Frequency MeasuresNormalized MeasuresLog-transformed MeasuresInformation content“Logo” techniqueHMMs
![Page 10: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/10.jpg)
2/8/07 CAP5510 10
Pattern Representation: Weight Matrix
[Wasserman, Sandelin,
Nat Genet, 2004]
Alignment
Consensus
Frequencies
Profile/PSSM/PWM
Scoring a sequence
against a profile
Visualizing a profile
![Page 11: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/11.jpg)
2/8/07 CAP5510 11
Formulae
Prob of char b in position i:
Corrected prob:
Weight matrix entry:
Information content of position of i:
Nf
ibp ib,),( =
∑∈
++
=
Αa
ib
asNbsf
ibP)(
)(),( ,
)(),(log2, bBP
ibPW ib =
),(log),(2 2 ibPibPDb
i ∑+=
Frequency
# Sequences
PseudoCount
Background Frequency
[Wasserman, Sandelin,
Nat Genet, 2004]
![Page 12: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/12.jpg)
2/8/07 CAP5510 12
Statistical Evaluation Fundamentals
Probability of finding a sequence w in some position of a DNA/protein sequence (assuming independence at each position)
Pr(wi) = BP(b) [Background Frequency]
)Pr()Pr(1 i
m
iww
=Π=
![Page 13: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/13.jpg)
2/8/07 CAP5510 13
Statistical Evaluation
Z-score of a motif with a certain frequency:
Information Content or Relative Entropy of an alignment or profile:Maximum a Posteriori(MAP) Score:Model Vs BackgroundScore:
)()()()(
wVarwExpwObswz −=
∑∑= =
=4
1 1
,, log)(
i
m
j i
jiji b
mmMIC
∑∑= =
−=4
1 1
,, log)(
i
m
j i
jiji b
mnMMAP
i
jim
j bm
BgwMwwL ,
1)|Pr()|Pr()(
=Π==
![Page 14: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/14.jpg)
2/8/07 CAP5510 14
Pattern Discovery in Protein Sequences
Examples: Helix-Turn-Helix, Zinc-finger, Homeobox domain, Hairpin-beta motif, Calcium-binding motif, Beta-alpha-beta motif,Coiled-coil motifs.
Examples: Helix-Turn-Helix, Zinc-finger, Homeobox domain, Hairpin-beta motif, Calcium-binding motif, Beta-alpha-beta motif,Coiled-coil motifs.
Motifs are combinations of secondary structures in proteins with a specific structure and a specific function.They are also called super-secondary structures.
Motifs are combinations of secondary structures in proteins with a specific structure and a specific function.They are also called super-secondary structures.
Several motifs may combine to form domains. • Serine proteinase domain, Kringle domain, calcium-binding domain, homeobox domain.
Several motifs may combine to form domains. • Serine proteinase domain, Kringle domain, calcium-binding domain, homeobox domain.
![Page 15: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/15.jpg)
2/8/07 CAP5510 15
Motif Detection
Profile MethodIf many examples of the motif are known, then
Training: build a Profile and compute a thresholdTesting: score against profile
Combinatorial Pattern Discovery Methods Gibbs SamplingExpectation MethodHMM
![Page 16: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/16.jpg)
2/8/07 CAP5510 16
How to evaluate these methods?
Calculate TP, FP, TN, FNCompute sensitivity fraction of known sites predicted, specificity, and more.
Sensitivity = TP/(TP+FN)Specificity = TN/(TN+FN)Positive Predictive Value = TP/(TP+FP)Performance Coefficient = TP/(TP+FN+FP)Correlation Coefficient =
![Page 17: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/17.jpg)
2/8/07 CAP5510 17
Motif Detection ProblemMotif Detection Problem
Input:Input: Set, S, of known (aligned) examples of a motif M,A new protein sequence, P.
Output:Output: Does P have a copy of the motif M?
Example: Zinc Finger Motif…YYKCCGLCCERSFFVEKSALLSRHHORVHHKN…
3 6 19 23
Input:Input: Database, D, of known protein sequences,A new protein sequence, P.
Output:Output: What interesting patterns from Dare present in P?
![Page 18: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/18.jpg)
2/8/07 CAP5510 18
Supervised Pattern Discovery
Input: Alignment of known motifs, and Query sequence
Output: Is the query sequence a motif?
Profile Method [Gribskov et al., 1996]Build a profile from the alignment and score query sequence against the profile to decide if it “fits the profile”.Need to pick a threshold score.
Enumerative/Combinatorial Methods
![Page 19: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/19.jpg)
2/8/07 CAP5510 19
Profile HMMs
STATE 1 ENDSTART STATE 2 STATE 3 STATE 4 STATE 5 STATE 6
![Page 20: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/20.jpg)
2/8/07 CAP5510 20
Combinatorial Method: GYMCombinatorial Method: GYM
Pattern Generation: Pattern Generation:
Pattern GeneratorAligned MotifExamples
Pattern DictionaryMotif Detection: Motif Detection:
Motif DetectorNew ProteinSequence
DetectionResults
[Narasimhan, Bu, Wang, Xu, Yang, Mathee, J Comput Biol, 2002]
![Page 21: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/21.jpg)
2/8/07 CAP5510 21
Helix-Turn-Helix MotifsHelix-Turn-Helix Motifs
Branden & Tooze
• Structure• 3-helix complex• Length: 22 amino acids• Turn angle
• Function• Gene regulation by
binding to DNA
![Page 22: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/22.jpg)
2/8/07 CAP5510 22
DNA Binding at HTH MotifDNA Binding at HTH Motif
Branden & Tooze
![Page 23: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/23.jpg)
2/8/07 CAP5510 23
HTH Motifs: ExamplesHTH Motifs: Examples
Loc Helix 2 Turn Helix 3
Protein Name -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
14 Cro F G Q E K T A K D L G V Y Q S A I N K A I H 16 434 Cro M T Q T E L A T K A G V K Q Q S I Q L I E A 11 P22 Cro G T Q R A V A K A L G I S D A A V S Q W K E 31 Rep L S Q E S V A D K M G M G Q S G V G A L F N 16 434 Rep L N Q A E L A Q K V G T T Q Q S I E Q L E N 19 P22 Rep I R Q A A L G K M V G V S N V A I S Q W E R 24 CII L G T E K T A E A V G V D K S Q I S R W K R 4 LacR V T L Y D V A E Y A G V S Y Q T V S R V V N 167 CAP I T R Q E I G Q I V G C S R E T V G R I L K 66 TrpR M S Q R E L K N E L G A G I A T I T R G S N 22 BlaA Pv L N F T K A A L E L Y V T Q G A V S Q Q V R 23 TrpI Ps N S V S Q A A E Q L H V T H G A V S R Q L K
![Page 24: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/24.jpg)
2/8/07 CAP5510 24
Combinatorial Method: GYMCombinatorial Method: GYM
Combinations of residues in specific locations (may not be contiguous) contribute towards stabilizing a structure. Some reinforcing combinations are relatively rare. GYM algorithm is inspired by the APriorialgorithm [Agrawal et al., 1996]
[Narasimhan, Bu, Wang, Xu, Yang, Mathee, J Comput Biol, 2002]
![Page 25: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/25.jpg)
PatternsPatterns
2/8/07 CAP5510 25
Loc Helix 2 Turn Helix 3
Protein Name -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
14 Cro F G Q E K T A K D L G V Y Q S A I N K A I H 16 434 Cro M T Q T E L A T K A G V K Q Q S I Q L I E A 11 P22 Cro G T Q R A V A K A L G I S D A A V S Q W K E 31 Rep L S Q E S V A D K M G M G Q S G V G A L F N 16 434 Rep L N Q A E L A Q K V G T T Q Q S I E Q L E N 19 P22 Rep I R Q A A L G K M V G V S N V A I S Q W E R 24 CII L G T E K T A E A V G V D K S Q I S R W K R 4 LacR V T L Y D V A E Y A G V S Y Q T V S R V V N 167 CAP I T R Q E I G Q I V G C S R E T V G R I L K 66 TrpR M S Q R E L K N E L G A G I A T I T R G S N 22 BlaA Pv L N F T K A A L E L Y V T Q G A V S Q Q V R 23 TrpI Ps N S V S Q A A E Q L H V T H G A V S R Q L K • Q1 G9 N20• A5 G9 V10 I15
![Page 26: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/26.jpg)
Pattern Mining Algorithm Pattern Mining Algorithm
2/8/07 CAP5510 26
Algorithm PatternPattern--MiningMiningInput: Motif length m, support threshold T,
list of aligned motifs M.Output: Dictionary L of frequent patterns.
1. L1 := All frequent patterns of length 1 2. for i = 2 to m do3. Ci := Candidates(Li-1)4. Li := Frequent candidates from Ci5. if (|Li| <= 1) then6. return L as the union of all Lj , j <= i.
![Page 27: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/27.jpg)
Candidates FunctionCandidates Function
2/8/07 CAP5510 27
G1, V2, S3 G1, V2, T6 G1, V2, I7G1, V2, E8G1, S3, T6G1, T6, I7V2, T6, I7V2, T6, E8
L3
G1, V2, S3, T6 G1, V2, S3, I7G1, V2, S3, E8G1, V2, T6, I7G1, V2, T6, E8G1, V2, I7, E8V2, T6, I7, E8
C4
G1, V2, S3, T6 G1, V2, S3, I7G1, V2, S3, E8
G1, V2, T6, E8
V2, T6, I7, E8
L4
![Page 28: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/28.jpg)
Motif Detection AlgorithmMotif Detection Algorithm
2/8/07 CAP5510 28
Algorithm MotifMotif--DetectionDetection
Input : Motif length m, threshold score T, pattern dictionary L,and input protein sequence P[1..n].
Output : Detected motif(s).
1. for each location i do2. S := MatchScore(P[i..i+m-1], L).3. if (S > T) then4. Report it as a possible motif
![Page 29: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/29.jpg)
Experimental Results: GYM 2.0Experimental Results: GYM 2.0
2/8/07 CAP5510 29
Motif Protein Family
Number Tested
GYM = DE Agree
Number Annotated
GYM = Annot.
Master 88 88 (100 %) 13 13 Sigma 314 284 + 23 (98 %) 96 82
Negates 93 86 (92 %) 0 0 LysR 130 127 (98 %) 95 93 AraC 68 57 (84 %) 41 34 Rreg 116 99 (85 %) 57 46
HTH Motif (22)
Total 675 653 + 23 (94 %) 289 255 (88 %)
![Page 30: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/30.jpg)
2/8/07 CAP5510 30
Unaligned Pattern DiscoveryUnaligned Pattern Discovery
Rigoutsos & Floratos, Bioinformatics, ’98
TEIRESIAS: The algorithm is similar to that used in GYM for aligned Pattern discovery.
TEIRESIASProtein SequenceDatabase
Seqlet Dictionary
A..GV
L..H…H
Y.C..C…F
V..G..G.G.T.L•••
![Page 31: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/31.jpg)
2/8/07 CAP5510 31
TEIRESIAS: Key Features
Starts with a set of seed patterns (Enumeration step) Convolution operator applied to all pairs of patterns:
A..GV.S ⊕ V.S.GR = A..GV.S.GROrder of Evaluation carefully chosen so that long patterns get longer firstFinds all maximal patterns.Combinatorial explosion avoided by generating only relevant maximal patterns.
Rigoutsos & Floratos, Bioinformatics, ’98
![Page 32: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/32.jpg)
2/8/07 CAP5510 32
SPLASH
Structural Pattern Localization Analysis by Sequential Histogram (SPLASH)Not limited to fixed alphabet sizePatterns are modeled by a homology metric and thus allow mismatchesEarly pruning of inconsistent seed patterns, leading to increased efficiency. Easily parallelized with availability of extra resources.
Califano, Bioinformatics, ’00; Califano et al., J Comput Biol, ’00
![Page 33: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/33.jpg)
2/8/07 CAP5510 33
Precomputed Sequence Patterns
PROSITEBLOCKS and PRINTSeMOTIFSPATPRODOMPfam
![Page 34: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/34.jpg)
2/8/07 CAP5510 34
Motif Detection ToolsPROSITE (Database of protein families & domains)
Try PDOC00040. Also Try PS00041PRINTS Sample OutputBLOCKS (multiply aligned ungapped segments for highly conserved regions of proteins; automatically created) Sample OutputPfam (Protein families database of alignments & HMMs)
Multiple Alignment, domain architectures, species distribution, links: TryMoSTPROBEProDomDIP
![Page 35: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/35.jpg)
2/8/07 CAP5510 35
Protein Information Sites
SwissPROT & GenBankInterPRO is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences. See sample.
PIR Sample Protein page
![Page 36: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/36.jpg)
2/8/07 CAP5510 36
Modular Nature of Proteins
Proteins are collections of “modular”domains. For example,
F2 E E
EF2
F2
K K
K Catalytic Domain
Catalytic Domain
PLAT
Coagulation Factor XII
![Page 37: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/37.jpg)
2/8/07 CAP5510 37
Domain Architecture Tools
CDARTProtein Domain ArchitectureAAH24495; ;It’s domain relatives; Multiple alignment for 2nd domain
SMART
![Page 38: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/38.jpg)
2/8/07 CAP5510 38
Predicting Specialized Structures
COILS – Predicts coiled coil motifsTMPred – predicts transmembrane regionsSignalP – predicts signal peptidesSEG – predicts nonglobular regions
![Page 39: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/39.jpg)
2/8/07 CAP5510 39
Patterns in DNA Sequences
Signals in DNA sequence control eventsStart and end of genesStart and end of intronsTranscription factor binding sites (regulatory elements)Ribosome binding sites
Detection of these patterns are useful for Understanding gene structureUnderstanding gene regulation
![Page 40: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/40.jpg)
2/8/07 CAP5510 40
Motifs in DNA Sequences
Given a collection of DNA sequences of promoter regions, locate the transcription factor binding sites (also called regulatory elements)
Example:
http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html
![Page 41: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/41.jpg)
2/8/07 CAP5510 41
Motifs
http://weblogo.berkeley.edu/examples.html
![Page 42: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/42.jpg)
2/8/07 CAP5510 42
Motifs in DNA Sequences
http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html
![Page 43: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/43.jpg)
2/8/07 CAP5510 43
More Motifs in E. Coli DNA Sequences
http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html
![Page 44: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/44.jpg)
2/8/07 CAP5510 44
http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html
![Page 45: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/45.jpg)
2/8/07 CAP5510 45
Other Motifs in DNA
Sequences: Human Splice
Junctions
http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html
![Page 46: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/46.jpg)
2/8/07 CAP5510 46
Motifs in DNA Sequences
![Page 47: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/47.jpg)
2/8/07 CAP5510 47
Motif Detection (TFBMs)
See evaluation by Tompa et al.[bio.cs.washington.edu/assessment]
Gibbs Sampling Methods: AlignACE, GLAM, SeSiMCMC, MotifSamplerWeight Matrix Methods: ANN-Spec, Consensus, EM: Improbizer, MEMECombinatorial & Misc.: MITRA, oligo/dyad, QuickScore, Weeder, YMF
![Page 48: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx.pdf · CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; ... Generates a “ton of data”](https://reader031.vdocuments.us/reader031/viewer/2022022520/5b1c610f7f8b9a1b688b9079/html5/thumbnails/48.jpg)
Gibbs Sampling for Motif Detection
2/8/07 CAP5510 48