an introduction to knowledge discovery applications and...

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006

An Introduction to Knowledge DiscoveryApplications and Challenges

in Life Sciences

Limsoon Wong

2

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Plan

• Quick introduction to knowledge discovery (13 slides, 20 min)

• Apps and challenges of knowledge discovery– Protein function inference (20 slides, 40 min)

– Gene feature recognition (25 slides, 60 min)

– Protein subcellular localization prediction (20 slides, 40 min)

– Disease diagnosis, treatment, and understanding (40 slides, 80 min)

– Biomedical text mining (35 slides, 50 min)

– Discovery of protein binding motifs (40 slides, 60 min)

• Advancing knowledge discovery (3 slides, 10 min)


Quick Intro to Knowledge Discovery

20 min

4


Jonathan’s rules : Blue or CircleJessica’s rules : All the rest

What is Knowledge Discovery?

Whose block is this?

Jonathan’s blocks

Jessica’s blocks

5


What is Knowledge Discovery?

Question: Can you explain how?

6


Main Steps of Knowledge Discovery

• Training data gathering• Feature generation

– k-grams, colour, texture, domain know-how, ...• Feature selection

– Entropy, χ2, CFS, t-test, domain know-how...• Feature integration

– SVM, ANN, PCL, CART, C4.5, kNN, ...

Someclassifier/methods

7


An Abstract Model of a Classifier

• Given a test sample S• Compute scores p(S), n(S)• Predict S as negative if p(S) < t * n(s)• Predict S as positive if p(S) ≥ t * n(s)

t is the decision threshold of the classifier

8


Neighborhood

5 of class3 of class

=

kNN (k=8)

Image credit: Zaki

9


kNN

• Given a sample S, find the k observations Si in the known data that are “closest” to it, and average their responses

• Assume S is well approximated by its neighbours

p(S) = Σ 1Si ∈Nk(S) ∩ DP

n(S) = Σ 1Si ∈Nk(S) ∩ DN

where Nk(S) is the neighbourhood of Sdefined by the k nearest samples to it.

Assume distance between samples is Euclidean distance for now

10


Curse of Dimensionality

• How much of each dimension is needed to cover a proportion r of total sample space?

• Calculate by ep(r) = r1/p

• So, to cover 1% of a 15-D space, need 85% of each dimension!

00.10.20.30.40.50.60.70.80.9

1

p=3 p=6 p=9 p=12 p=15

r=0.01r=0.1

Exercise: Why ep(r) = r1/p?

11


Consequence of the Curse

• Suppose the number of samples given to us in the total sample space is fixed

• Let the dimension increase

• Then the distance of the k nearest neighbours of any point increases

• Then the k nearest neighbours are less and less useful for prediction, and can confuse the k-NN classifier

12


Tackling the Curse

• Given a sample space of p dimensions

• It is possible that some dimensions are irrelevant

• Need to find ways to separate those dimensions (aka features) that are relevant (aka signals) from those that are irrelevant (aka noise)

13


Signal Selection (Basic Idea)

• Choose a feature w/ low intra-class distance• Choose a feature w/ high inter-class distance

14


Signal Selection (e.g., t-statistics)

15


Self-fulfilling Oracle

• Construct artificial dataset with 100 samples, each with 100,000 randomly generated features and randomly assigned class labels

• select 20 features with the best t-statistics (or other methods)

• Evaluate accuracy by cross validation using only the 20 selected features

• The resultant estimated accuracy can be ~90%

• But the true accuracy should be 50%, as the data were derived randomly

16


What Went Wrong?

• The 20 features were selected from the whole dataset

• Information in the held-out testing samples has thus been “leaked” to the training process

• The correct way is to re-select the 20 features at each fold; better still, use a totally new set of samples for testing


Protein Function Inference

40 min

18


Sequence Alignment

• Key aspect of seqcomparison is seqalignment

• A seq alignment maximizes the number of positions that are in agreement in two sequences

Sequence U

Sequence V

mismatch

match

indel

19


Sequence Alignment: Poor Example

• Poor seq alignment shows few matched positions⇒ The two proteins are not likely to be homologous

No obvious match between Amicyanin and Ascorbate Oxidase

20


Sequence Alignment: Good Example

• Good alignment usually has clusters of extensive matched positions

⇒ The two proteins are likely to be homologous

good match between Amicyanin and unknown M. loti protein

21


Multiple Alignment: An Example

• Multiple seq alignment maximizes number of positions in agreement across several seqs

• seqs belonging to same “family” usually have more conserved positions in a multiple seqalignment

Conserved sites

22


Function Assignment to Protein Seq

• How do we attempt to assign a function to a new protein sequence?

SPSTNRKYPPLPVDKLEEEINRRMADDNKLFREEFNALPACPIQATCEAASKEENKEKNRYVNILPYDHSRVHLTPVEGVPDSDYINASFINGYQEKNKFIAAQGPKEETVNDFWRMIWEQNTATIVMVTNLKERKECKCAQYWPDQGCWTYGNVRVSVEDVTVLVDYTVRKFCIQQVGDVTNRKPQRLITQFHFTSWPDFGVPFTPIGMLKFLKKVKACNPQYAGAIVVHCSAGVGRTGTFVVIDAMLDMMHSERKVDVYGFVSRIRAQRCQMVQTDMQYVFIYQALLEHYLYGDTELEVT

23


Guilt-by-AssociationCompare T with seqs of known function in a db

Assign to T same function as homologs

Confirm with suitable wet experiments

Discard this functionas a candidate

24


BLAST: How It WorksAltschul et al., JMB, 215:403--410, 1990

• BLAST is one of the most popular tool for doing “guilt-by-association” sequence homology search

find from db seqswith short perfectmatches to queryseq

find seqs withgood flanking alignment

25


26


27


28


Homologs obtained by BLAST

• Thus our example sequence could be a protein tyrosine phosphatase α (PTPα)

29


Example Alignment with PTPα

30


Guilt-by-Association: Caveats

• Ensure that the effects of database size and composition have been accounted for

• Ensure that the function of the homology is not derived via invalid “transitive assignment’’

• Ensure that the target sequence has all the key features associated with the function, e.g., active site and/or domain

31


Law of Large Numbers

• Suppose you are in a room with 365 other people

• Q: What is the prob that a specific person in the room has the same birthday as you?

• A: 1/365 = 0.3%

• Q: What is the prob that there is a person in the room having the same birthday as you?

• A: 1 – (364/365)365 = 63%

• Q: What is the prob that there are two persons in the room having the same birthday?

• A: 100%

32


Interpretation of P-value

• Seq. comparison progs, e.g. BLAST, often associate a P-value to each hit

• P-value is interpreted as prob that a random seqhas an equally good alignment

• Suppose the P-value of an alignment is 10-6

• If database has 107 seqs, then you expect 107 * 10-6 = 10 seqs in it that give an equally good alignment

⇒ Need to correct for database size if your seqcomparison prog does not do that!

33


Lightning Does Strike Twice!

• Roy Sullivan, a former park ranger from Virgina, was struck by lightning 7 times– 1942 (lost big-toe nail)– 1969 (lost eyebrows)– 1970 (left shoulder seared)– 1972 (hair set on fire)– 1973 (hair set on fire & legs seared)– 1976 (ankle injured)– 1977 (chest & stomach burned)

• September 1983, he committed suicideCartoon: Ron Hipschman

Data: David Hand

34


Effect of Seq Compositional Bias

• One fourth of all residues in protein seqs occur in regions with biased amino acid composition

• Alignments of two such regions achieves high score purely due to segment composition

• While it is worth noting that two proteins contain similar low complexity regions, they are best excluded when constructing alignments

• BLAST employs the SEG algorithm to filter low complexity regions from proteins before executing a search

Source: NCBI

35


Effect of Sequence Length

Source: Abagyan & Batalov

Distribution of seq identity vs length of unrelated proteins

36


Important Unsolved Challenges

• What if there is no useful sequence homolog?• Guilt by other types of association!

– Domain modeling (e.g., HMMPFAM)– Similarity of dissimilarities (e.g., SVM-PAIRWISE)– Similarity of phylogenetic profiles– Similarity of subcellular co-localization & other

physico-chemico properties(e.g., PROTFUN)– Similarity of gene expression profiles– Similarity of protein-protein interaction partners– …– Fusion of multiple types of info

37


Fusion of Multiple Types of Info

Model each data source into a graph

Combine graphs from diff sources to form larger graph

38


Recommended Readings

• Wong, The Practical Bioinformatician, 2004, ICP. Chapters 10 & 19

• Chua et al., Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics, 2006. (To appear)

• Bateman et al., The Pfam protein families database. NAR, 32:D138-D141, 2004

• Jaakkola et al., A discriminative framework for detecting remote protein homologies. JCB, 7:95-114, 2000

• Jensen et al., Prediction of human protein function from post-translational modifications and localization features. JMB, 319:1257-1265, 2002

• Pellegrini et al., Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. PNAS, 96:4285-4288, 1999


Gene Feature Recognition

60 min

40


...AATGGTACCGATGACCTG... ...TRLRPLLALLALWP......AAUGGUACCGAUGACCUGGAGC...

Central Dogma

41


Translation Initiation Site

42


A Sample cDNA

• What makes the second ATG the TIS?

299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

43


Approach

• Training data gathering• Signal generation

– k-grams, distance, domain know-how, ...• Signal selection

– Entropy, χ2, CFS, t-test, domain know-how...• Signal integration

– SVM, ANN, PCL, CART, C4.5, kNN, ...

44


Training & Testing Data

• Vertebrate dataset of Pedersen & Nielsen [ISMB’97]

• 3312 sequences• 13503 ATG sites• 3312 (24.5%) are TIS• 10191 (75.5%) are non-TIS• Use for 3-fold x-validation expts

45


Signal Generation

• K-grams (ie., k consecutive letters)– K = 1, 2, 3, 4, 5, …– Window size vs. fixed position– Up-stream, downstream vs. any where in window– In-frame vs. any frame

0

0.5

1

1.5

2

2.5

3

A C G T

seq1seq2seq3

46


Signal Generation: An Example

• Window = ±100 bases• In-frame, downstream

– GCT = 1, TTT = 1, ATG = 1…• Any-frame, downstream

– GCT = 3, TTT = 2, ATG = 2…• In-frame, upstream

– GCT = 2, TTT = 0, ATG = 0, ...

299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT

47


Too Many Signals

• For each value of k, there are 4k * 3 * 2 k-grams

• If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 + 1536 + 6144 = 8184 features!

• This is too many for most machine learning algorithms

48


Signal Selection (Basic Idea)

• Choose a signal w/ low intra-class distance• Choose a signal w/ high inter-class distance

49


Signal Selection (e.g., CFS)

• Instead of scoring individual signals, how about scoring a group of signals as a whole?

• CFS– Correlation-based Feature Selection– A good group contains signals that are highly

correlated with the class, and yet uncorrelated with each other

50


Sample K-grams Selected by CFS

• Position –3• in-frame upstream ATG• in-frame downstream

– TAA, TAG, TGA, – CTG, GAC, GAG, and GCC

Kozak consensus Leaky scanning

Stop codon

Codon bias?

51


Signal Integration

• kNN– Given a test sample, find the k training samples

that are most similar to it. Let the majority class win

• SVM– Given a group of training samples from two

classes, determine a separating plane that maximises the margin of error

• Naïve Bayes, ANN, C4.5, ...

52


Results (3-fold x-validation)

TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy

Naïve Bayes 84.3% 86.1% 66.3% 85.7%

SVM 73.9% 93.2% 77.9% 88.5%

Neural Network 77.6% 93.2% 78.8% 89.4%

Decision Tree 74.0% 94.4% 81.1% 89.4%

53


Improvement by Scanning

• Apply Naïve Bayes or SVM left-to-right until first ATG predicted as positive. That’s the TIS

• Naïve Bayes & SVM models were trained using TIS vs. Up-stream ATG

TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy

NB 84.3% 86.1% 66.3% 85.7%

SVM 73.9% 93.2% 77.9% 88.5%

NB+Scanning 87.3% 96.1% 87.9% 93.9%

SVM+Scanning 88.5% 96.3% 88.6% 94.4%

54


mRNA→Protein

F

L

I

MV

S

P

T

A

Y

H

Q

N

K

D

E

C

WR

G

AT

E

LR

S

stop

How about using k-grams from the translation?

55


Amino-Acid Features

56


Amino-Acid Features

57


Amino Acid K-grams Discovered (by Entropy)

58


Independent Validation Sets

• A. Hatzigeorgiou:– 480 fully sequenced human cDNAs– 188 left after eliminating sequences similar to

training set (Pedersen & Nielsen’s)– 3.42% of ATGs are TIS

• Our own:– well characterized human gene sequences from

chromosome X (565 TIS) and chromosome 21 (180 TIS)

59


Validation Results (on Hatzigeorgiou’s)

– Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s dataset

60


ATGpr

Ourmethod

Validation Results (on Chr X and Chr 21)

• Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s

61



• Other gene features: TSS, PolyA, Splicing, …

• Gene regulation: TFBS, EREs, miRNA targets, …

• “Dark matters”: miRNA & non-coding genes

62


TSS: Is State of the Art Good Enough?

63


TFs and TFBSs

• Recognize TFBS– Short & degenerate– Incomplete erroneous data

• Characterize type of regulation, e.g., suppressor vs enhancer

• Identify synergistic TF pairs– Genes that are coordinately

bound are coordinately expressed

– Requires data over many time points

– TFs may interact only under certain conditions

• A method to identify synergistic TF pairs in cell cycle– Study if a pair of TFs are

assoc with the same genes more often than random

– For each significant pair, test whether there is a phase in cell cycle where the common associated genes are significantly diff in other cell cycle phases

⇒ Quite primitive. Lots more work needed!

64


miRNA: New Frontier in Molecular Biology

• Impt functions of miRNA: – Control of cell fate and fat

metabolism in flies– Neuronal patterning in

nematodes– Modulation of

hematopoietic lineage differentiation in mammals

– control of leaf and flower development in plants

• By regulating gene expr in binding targets:– Repression/inhibition of

translation– Degradation of mRNA

• miRNA identification thru direct cloning has limitations:– miRNAs may be expressed

only in certain tissues or at certain times

– Expr levels vary greatly – Degradation products from

mRNAs and other endogenous non-coding RNAs coexist w/ miRNAsand are sometimes dominant in small RNA molecule samples extracted from cells

65



• Wong, The practical Bioinformatician, 2004, ICP. Chapters 4, 5, 6, & 7

• Liu & Wong, Data mining tools for biological sequences. JBCB, 1:139-168, 2003

• Tsai et al., Statistical methods for identifying yeast cell cycle transcription factors. PNAS, 102:13532-13537, 2005

• Yang et al., Identification of microRNA precursors via SVM. APBC 2006, pages 267-276


Protein SubcellularLocalization Prediction

40 min

67


Compartments and Sorting

• Eukaryotic cells requires proteins be targeted to their subcellulardestinations

• Protein sorting is determined by specific amino acid sequences, or “signals”, within the protein

• Secretory pathway targets proteins to plasma membrane, some membrane-bound organelles such as lysosomes, or to export proteins from the cell

68


Datasets

• Reinhartdt & Hubbard, NAR, 26:2230--2236, 1998– 2427 eukaryotic proteins for 4 locations (cytoplasmic, extracellular,

nuclear,& mitochondrial)– 997 prokaryotic proteins for 3 locations (cytoplasmic, extracellular, &

periplasmic)

• Park & Kanehisa, Bioinformatics, 19:1656--1663, 2003

– 7589 eukaryotic proteins from 709 organisms for 12 locations (chloroplast, cytoplasmic, cytoskeleton, ER, extracellular, golgi, lysosomal, mitochondrial, nuclear, peroxisomal, plasma membrane, vacuolar)

• Chou & Cai, JBC, 277:45765--45769, 2002– 2191 proteins for 12 locations

• Emanuelsson et al., JMB, 300:1005--1016, 2000• Gardy et al., NAR, 31:3613--3617, 2003

69


Common Eukaryotic Protein Sorting Signals

For a comprehensive list of cellular localization sites, seehttp://mendel.imp.univie.ac.at/CELL_LOC/index.html

70


Schematic View of SortingSignals

cleavage site

~25aa

71


Sequence Logos ofSP, mTP, & cTP

SPsignal peptide

mTPmitochondrial

transfer peptide

cTPchloroplast

transit peptide

72


Expert System Approach: PSORT Horton & Nakai, ISMB, 1997

A simplified version of the decision tree thatPSORT uses tocheck and reasonover various sorting signals

73


Amino acid composition of

proteins residing in

different sites are different

74


Amino Acid Composition Differences

• each cellular location has own characteristic physio-chemical environment

• proteins in each location have adapted thru evolution to that environment

• thus reflected in the protein structure and amino acid composition

• If the above is true, the amino acid composition differences wrt cellular location sites should be more pronounced on protein surfaces than protein interior

• Exercise: Why?

75


Adaptation of Protein SurfacesAndrade et al., JMB, 1998

• To test the theory of adaptation of protein surfaces to subcellularlocalization, we do a plot of 3 types of composition vectors along their first two principal components

Proportion ofjth amino acid type in ith protein

76


Adaptation of Protein Surfaces Andrade et al., JMB, 1998

• Clearly total & surface composition vectors show better separation than interior composition vectors

Total amino acidcomposition vector

Interior amino acidcomposition vector

Surface amino acidcomposition vector

nuclearcytoextracell

77


Amino Acid Composition

• This means can use amino acid composition vectors, especially those from protein surfaces, to predict subcellular localization!

• Let’s see how this turn out….

78


Neural Networks: NNPSLReinhardt & Hubbard, NAR, 26:2230--2236, 1998

Input1

Input20

cytoplasmic

extracellular

mitochodrial

nuclear

fraction of each aminoacid in the input protein

79


NNPSL:Performance

• Outputs of NNPSL have values 0 to 1. The difference (Δ) between the highest and the next highest nodes can be used as a reliability index

0 < Δ < 0.2

0.2 < Δ < 0.4

0.4 < Δ < 0.6

0.6 < Δ < 0.8

0.8 < Δ < 1

Dataset: Reinhardt & Hubbard,NAR, 1998

80


Support Vector Machines: SubLocHua & Sun, Bioinformatics, 17:721--728, 2001

extracellularvs rest

nuclearvs rest

cytoplasmicvs rest

mitochondrialvs rest

ArgmaxX X-vs-rest

SVM

SVM

SVM

SVMThe SVMs use • polynomial kernel with d = 9 (prokaryotic),

K(Xi,Xj) = (Xi ·Xj + 1)d

• RBF kernel with γ=16 (eukaryotic),K(Xi, Xj) = exp(- γ |Xi - Xj|2)

20-dimensional vector giving amino

acid composition of the input protein

81


SubLoc:

PerformanceNNPSL SubLoc

(Eukaryotic)

Dataset: Reinhardt & Hubbard, NAR, 1998

82


SubLoc: Robustness of Amino Acid Composition Approach

• Amazingly, accuracy of SubLoc is virtually unaffected when the first 10, 20, 30, & 40 amino acids in a protein are deleted

• Amino acid composition is a robust indicator of subcellularlocalization, and is insensitive to errors in N-terminal sequences

83


Amino Acid Composition:Taking it Further

• How about pairs of consecutive amino acids? (a.k.a 2-grams) How about 3-grams, …, k-grams?

• How about pseudo amino acid composition?• How about presence of entire functional

domains? (I.e. think of the presence/absence of a functional domain as a summary of amino acid sequence info...)

84


Functional Domain CompositionCai & Chou, BBRC, 305:407--411, 2003

Training seqs of various localizationsites

BLAST againstdb of known functional domains(Interpro)

NN-5875D:Train k-NN (k=1) using these vectors

or, if nohit found

Pseudo aminoacid composition

Aminoacidcomposition

NN-40D:Train k-NN (k=1) using these vectors

If a protein got a hit in Interpro,use NN-5875D; else use NN-40D

85


Functional Domain Composition:

Performance

Dataset: Reinhardt & Hubbard, NAR, 1998

86


Recommended Readings• Horton & Nakai, “Better prediction of protein cellular localization sites

with the k-nearest neighbours classifier”, ISMB, 5:147--152, 1997

• Gardy et al., “PSORT-B: Improving protein subcellular localization for Gram-negative bacteria”, NAR, 31:3613--3617, 2003

• Emanuelsson, “Predicting protein subcellular localization from amino acid sequence information”, BIB, 3:361--376, 2002

• Andrade et al., “Adaptation of protein surfaces to subcellular location”, JMB, 276:517--525, 1998

• Yuan, “Prediction of protein subcellular locations using Markov chain models”, FEBS Letters, 451:23--26, 1999

• Hua & Sun, “Support vector machine approach for protein subcellularlocalization prediction”, Bioinformatics, 17:721--728, 2001

• Park & Kanehisa, “Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs”, Bioinformatics, 19:1656--1663, 2003


Disease Diagnosis, Treatment, & Understanding

80 min

88


• The subtypes look similar

• Conventional diagnosis– Immunophenotyping– Cytogenetics– Molecular diagnostics

Childhood Acute Lymphoblastic Leukemia

• Major subtypes: T-ALL, E2A-PBX, TEL-AML, BCR-ABL, MLL genome rearrangements, Hyperdiploid>50

• Diff subtypes respond differently to same Tx

• Over-intensive Tx– Development of

secondary cancers– Reduction of IQ

• Under-intensiveTx– Relapse

89


Childhood ALL Cure Rates

• Conventional risk assignment procedure requires difficult expensive tests and collective judgement of multiple specialists

⇒ Not available in less advanced ASEAN countries

⇒ Can we have a single-test easy-to-use platform instead?0% 20% 40% 60% 80%

singapore

malaysia

indonesia

philippines

thailand

vietnam

cambodia cure rate

90


Single-Test Easy-to-Use Platform

91


A Sample Affymetrix GeneChipData File (U95A)

92


Overall Strategy

• For each subtype, select genes to develop classification model for diagnosing that subtype

• For each subtype, select genes to develop prediction model for prognosis of that subtype

Diagnosis of subtype

Subtype-dependentprognosis

Risk-stratifiedtreatmentintensity

93


Subtype Diagnosis by PCL

• Gene expression data collection

• Gene selection by χ2

• Classifier training by emerging pattern

• Classifier tuning (optional for some machine learning methods)

• Apply classifier for diagnosis of future cases by PCL

94


Childhood ALL Subtype Diagnosis Workflow

A tree-structureddiagnostic workflow was recommended byour doctor collaborator

95


Training and Testing Sets

96


Signal Selection Basic Idea

• Choose a signal w/ low intra-class distance• Choose a signal w/ high inter-class distance

97


Signal Selection by χ2

98


Emerging Patterns

• An emerging pattern is a set of conditions– usually involving several features– that most members of a class satisfy – but none or few of the other class satisfy

• A jumping emerging pattern is an emerging pattern that – some members of a class satisfy– but no members of the other class satisfy

• We use only jumping emerging patterns

99


Examples

Reference number 9: the expression of gene 37720_at > 215Reference number 36: the expression of gene 38028_at <= 12

Patterns Frequency (P) Frequency(N){9, 36} 38 instances 0{9, 23} 38 0{4, 9} 38 0{9, 14} 38 0{6, 9} 38 0{7, 21} 0 36{7, 11} 0 35{7, 43} 0 35{7, 39} 0 34{24, 29} 0 34

Easy interpretation

100


PCL: Prediction by Collective Likelihood

101


PCL Learning

Top-Ranked EPs inPositive class

Top-Ranked EPs inNegative class

EP1P (90%)

EP2P (86%)

.

.EPn

P (68%)

EP1N (100%)

EP2N (95%)

.

.EPn

N (80%)

The idea of summarizing multiple top-ranked EPs is intendedto avoid some rare tie cases

102


PCL Testing

ScoreP = EP1P’ / EP1

P + … + EPkP’ / EPk

P

Most freq EP of pos classin the test sample

Most freq EP of pos class

Similarly, ScoreN = EP1

N’ / EP1N + … + EPk

N’ / EPkN

If ScoreP > ScoreN, then positive class, Otherwise negative class

103


Accuracy of PCL (vs. other classifiers)

The classifiers are all applied to the 20 genes selected by χ2 at each level of the tree

104


Understandability of PCL

• E.g., for T-ALL vs. OTHERS, one ideally discriminatory gene 38319_at was found, inducing these 2 EPs

• These give us the diagnostic rule

105


Multidimensional Scaling Plot for Subtype Diagnosis

Obtained by performing PCA on the 20 genes chosen for each level

106


ImpactConventional Tx:• intermediate intensity to everyone⇒ 10% suffers relapse⇒ 50% suffers side effects⇒ costs US$150m/yr

Our optimized Tx:• high intensity to 10%• intermediate intensity to 40%• low intensity to 50%• costs US$100m/yr

•High cure rate of 80%• Less relapse

• Less side effects• Save US$51.6m/yr

107


Asian Innovation Gold Award, Far Eastern Economic Review, November 2003

108


Is there a new subtype?

• Hierarchical clustering of gene expression profiles reveals a novel subtype of childhood ALL

109


Hierarchical Clustering

• Assign each item to its own cluster– If there are N items initially, we get N clusters,

each containing just one item• Find the “most similar” pair of clusters, merge

them into a single cluster, so we now have one less cluster – “Similarity” is often defined using

• Single linkage• Complete linkage• Average linkage

• Repeat previous step until all items are clustered into a single cluster of size N

110


Single, Complete, & Average Linkage

Single linkage defines distancebetw two clusters as min distancebetw them

Complete linkage defines distancebetw two clusters as max distance betwthem

Exercise: Give definition of “average linkage”

Image source: UCL Microcore Website


Selection of Patient Samples and Genes for

Disease Prognosis

112


Gene Expression Profile + Clinical Data

⇒ Outcome Prediction

• Univariate & multivariate Cox survival analysis (Beer et al 2002, Rosenwald et al 2002)

• Fuzzy neural network (Ando et al 2002)

• Partial least squares regression (Park et al 2002)

• Weighted voting algorithm (Shipp et al 2002)

• Gene index and “reference gene” (LeBlanc et al 2003)

• ……

113


Another Approach …

“extreme”sampleselection

ERCOF

114


Short-term Survivors v.s. Long-term Survivors

T: sampleF(T): follow-up time

E(T): status (1:unfavorable; 0: favorable)c1 and c2: thresholds of survival time

Short-term survivorswho died within a short

period

F(T) < c1 and E(T) = 1

⇓

Long-term survivorswho were alive after a

long follow-up time

F(T) > c2

⇓

Extreme Sample Selection

115


ERCOFEntropy-

Based Rank Sum Test & Correlation

Filtering

Remove genes with expression values w/o cut point found (can’t be discretized)

Calculate Wilcoxonrank sum w(x) for gene x. Remove gene x if w(x)∈ [clower, cupper]

Group features by Pearson Correlation For each group, retain the top 50% wrt class entropy

116


Linear Kernel SVM regression functionbixTKyaTG i

ii +=∑ ))(,()(

T: test sample, x(i): support vector,yi: class label (1: short-term survivors; -1: long-term survivors)

Transformation function (posterior probability)

)(11)( TGe

TS −+= ))1,0()(( ∈TS

S(T): risk score of sample T

Risk Score Construction

117


Diffuse Large B-Cell Lymphoma

• DLBC lymphoma is the most common type of lymphoma in adults

• Can be cured by anthracycline-based chemotherapy in 35 to 40 percent of patients

⇒ DLBC lymphoma comprises several diseases that differ in responsiveness to chemotherapy

• Intl Prognostic Index (IPI) – age, “Eastern Cooperative

Oncology Group” Performance status, tumor stage, lactate dehydrogenase level, sites of extranodal disease, ...

• Not very good for stratifying DLBC lymphoma patients for therapeutic trials

⇒ Use gene-expression profiles to predict outcome of chemotherapy?

118


Rosenwald et al., NEJM 2002

• 240 data samples– 160 in preliminary group– 80 in validation group– each sample described by 7399 microarray

features• Rosenwald et al.’s approach

– identify gene: Cox proportional-hazards model– cluster identified genes into four gene signatures– calculate for each sample an outcome-predictor

score– divide patients into quartiles according to score

119


Knowledge Discovery from Gene Expression of “Extreme” Samples

“extreme”sampleselection:< 1 yr vs > 8 yrs

knowledgediscovery from gene expression

240 samples

80 samples26 long-

term survivors

47 short-term survivors

7399genes

84genes

T is long-term if S(T) < 0.3T is short-term if S(T) > 0.7

120


732547+1(*)Informative

1607288OriginalDLBCL

AliveDead

TotalStatusData setApplication

Number of samples in original data and selected informative training set.(*): Number of samples whose corresponding patient was dead at the end of follow-up time, but selected as a long-term survivor.

Discussions: Sample Selection

121


84(1.7%)Phase II

132(2.7%)Phase I

4937(*)Original

DLBCLGene selection

Number of genes left after feature filtering for each phase.(*): number of genes after removing those genes who were absent in more than 10% of the experiments.

Discussions: Gene Identification

122


p-value of log-rank test: < 0.0001Risk score thresholds: 0.7, 0.3

Kaplan-Meier Plot for 80 Test Cases

123


(A) IPI low, p-value = 0.0063

(B) IPI intermediate,p-value = 0.0003

Improvement Over IPI

124


(A) W/o sample selection (p =0.38) (B) With sample selection (p=0.009)

No clear difference on the overall survival of the 80 samples in the validation group of DLBCL study, if no training sample selection conducted

Merit of “Extreme” Samples

125


Is ERCOF Useful? Observations from 1000+ Expts

• Feature selection methods considered– All use all features

– All-entropy select features whose value range can be partitioned by Fayyad & Irani’sentropy method

– Mean-entropy select features whose entropy is better than the mean entropy

– Top-number-entropy select the top 20, 50, 100, 200 genes by their entropy

– ERCOF at 5% significant level for Wilcoxon rank sum test and 0.99 Pearson correlation coeff threshold

• Data sets considered– Colon tumor– Prostate cancer– Lung cancer– Ovarian cancer– DLBC lymphoma– ALL-AML– Childhood ALL

• Learning methods considered– C4.5– Bagging, Boosting, CS4– SVM, 3-NN

126


ERCOF vs All-Entropy

All-entropywins 4 times

ERCOF wins 60 times

127


ERCOF vs Mean-Entropy

Mean-entropywins 18 times

ERCOF wins 42 times

128


Effectiveness of ERCOF

22Total wins 38 46 47 48 51 47 67

129


Conclusions

• Selecting extreme cases as training samples is an effective way to improve patient outcome prediction based on gene expression profiles and clinical information

• ERCOF is very suitable for SVM, 3-NN, CS4, Random Forest, as it gives these learning algoshighest no. of wins

• ERCOF is suitable for Bagging also, as it gives this classifier the lowest no. of errors

⇒ ERCOF is a systematic feature selection method that is very useful

130


Beyond Classification of Gene Expression Profiles

• After identifying the candidate genes by feature selection, do we know which ones are causal genes and which ones are surrogates?

Diagnostic ALL BM samples (n=327)

3σ-3σ -2σ -1σ 0 1σ 2σσ = std deviation from mean

Gen

es fo

r cla

ss

dist

inct

ion

(n=2

71)

TEL-AML1BCR-ABL

Hyperdiploid >50E2A-PBX1

MLL T-ALL Novel

131


Gene Regulatory Circuits

• Genes are “connected”in “circuit” or network

• Expr of a gene in a network depends on expr of some other genes in the network

• Can we “reconstruct”the gene network from gene expression and other data? Source: Miltenyi Biotec

132



• Wong, The Practical Bioinformatician, 2004, ICP. Chapter 14• Li & Wong, Identifying good diagnostic genes or gene groups

from gene expression data by using the concept of emerging patterns. Bioinformatics, 18:725-734, 2002

• Miller et al., Optimal gene expression analysis by microarrays. Cancer Cell, 2:353-361, 2002

• Liu et al,. Use of Extreme Patient Samples for Outcome Prediction from Gene Expression Data. Bioinformatics, 21:3377-3384, 2005

• Eisen et al., Cluster analysis and display of genome-wide expression patterns. PNAS, 8:14863-14868, 1998

• Tibshirani et al., Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS, 99:6567-6572, 2002


Biomedical Text Mining

50 min

134


Text Mining of Biomedical Literature

• Knowledge discovery system for processing free texts in scientific abstracts

• Specialized in– Extracting

protein names– Extracting

protein-protein interactions

– Etc …

135


Main Steps of Text Mining

136


Bio Entity Name Recognition, Zhou et al., BioCreAtIvE, 2004

Adapted w/ permission from Jian Su

137


Bio Entity Name Recognition:

Ensemble Classification Approach• Features considered

– orthographic, POS, morphologic, surface word, trigger words (TW1: receptor, enhancer, etc. TW2: activation, stimulation, etc.)

• SVM– Context of 7 words– Each word gives 5 features,

plus its position• HMMs

– 3 features used (orthographic, POS, surface word)

– HMM1 & HMM2 use POS taggers trained on diff corpora

HMM1balanced precision

& recall

HMM2low precision& high recall

SVMhigh precision& low recall

MajorityVoting

138


• h1, h2, h3 are indep classifiers w/ accuracy = 60%• C1, C2 are the only classes• t is a test instance in C1

• h(t) = argmaxC∈{C1,C2} |{hj ∈{h1, h2, h3} | hj(t) = C}|• Then prob(h(t) = C1)

= prob(h1(t)=C1 & h2(t)=C1 & h3(t)=C1) +prob(h1(t)=C1 & h2(t)=C1 & h3(t)=C2) +prob(h1(t)=C1 & h2(t)=C2 & h3(t)=C1) +prob(h1(t)=C2 & h2(t)=C1 & h3(t)=C1)

= 60% * 60% * 60% + 60% * 60% * 40% +60% * 40% * 60% + 40% * 60% * 60% = 64.8%

Why Ensemble?

139


Bio Entity Name Recognition:

Performance at BioCreAtIvE 2004

Adapted w/ permission from Zhou et al., BioCreAtIve 2004

140


Co-Reference Resolution, Yang et al., IJCNLP, 2004

Adapted w/ permission from Yang et al., IJCNLP 2004

141


Co-Reference Resolution:

Baseline Features Used


142


Co-Reference Resolution:

New Features Used & Performance


Base Classifier:

C5.0

143


• The max entropy model:

• where– o is the outcome– h is the feature vector– Z(h) is normalization

function– fj are feature functions– αj are feature weights

Protein Interaction Extraction, Xiao et al, IJCNLP 2004

Adapted w/ permission from Xiao et al., IJCNLP 2004

144


Protein Interaction Extraction:

Features Used

Adapted w/ permission from Xiao et al.

145


Protein Interaction Extraction:

Performance on IEPA Corpus

Adapted w/ permission from Xiao et al.

IEPA = Interaction Extraction Performance Assessment

146


Issues in Constructing Gene/Protein Networks

• Sound:– Is the contents of our databases correct?

• Complete:– Is the structure of our databases expressive

enough to capture critical information explicitly? • Understandable:

– Is our databases or search results understandable?

147

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon WongAdapted w/ permission from Judice Koh, Mong Li Lee, & Vladimir Brusic

Soundness:Is the content of our databases correct?

• 11 types & 28 subtypes of data artifacts– Critical artifacts (vector

contaminated sequences, duplicates, sequence structure violations)

– Non-critical artifacts (misspellings, synonyms)

• > 20,000 seq records in public contain artifacts

• Identification of these artifacts are impt for accurate knowledge discovery

• Sources of artifacts– Diverse sources of data

• Repeated submissions of seqs to db’s• Cross-updating of db’s

– Data Annotation• Db’s have diff ways for data

annotation• Data entry errors can be introduced• Different interpretations

– Lack of standardized nomenclature

• Variations in naming• Synonyms, homonyms, & abbrevn

– Inadequacy of data quality control mechanisms

• Systematic approaches to data cleaning are lacking

148


Classification of Errors,

Koh et al, DBiDB, 2005

Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

149


Context of the misspellingsCorrectionsMisspellings

EMBL:Y18050 E.faecium pbp5 geneTITLE Modification of penicillin-binding protein 5 asociated with highlevel ampicillin resistance in Enterococcus faeciumgi|1143442|emb|X92687.1|EFPBP5G

associatedasociated

Swiss-Prot:P03385Env polyprotein precursor DEFINITION Env polyprotein precursor [Contains: Surface protein (SU) (GP70);Tranmembrane protein (TM) (p15E); R protein].gi|119478|sp|P03385|ENV_MLVMO

transmembranetranmembrane

Patent Database:A76783 Sequence 11 from Patent WO9315210CDS <1..150/note="gene cassete encoding intercalating jun-zipper andlinker"gi|6088638|emb|A76783.1||pat|WO|9315210|11[6088638]

CassetteCassete

GenBank:AAD26534 nectin-1 [Rattus norvegicus]TITLE Nectin/PRR: An Immunogloblin-like Cell Adhesion Molecule Recruitedto Cadherin-based Adherens Junctions through Interaction withAfadin, a PDZ Domain-containing Proteingi|4590334|gb|AAD26534.1

ImmunoglobulinImmunoglobinRECORD

SINGLE SOURCE DATABASE

Invalid values

Ambiguity

Incompatible schema

ATTRIBUTE

Spelling errorsFormat violation

Annotation error

Dubious sequences

Sequence redundancy

Data Provenanceflaws

Cross-annotation error

Sequence structure violation

• Usually typo errors

• Occurs in different fields of the record

• We identified 569 possible misspelled words affecting up to 20,505 nucleotide records in Entrez.

Vector contaminated sequence

Erroneous data transformation

MULTIPLESOURCE DATABASE

Spelling Errors

Copyright © 2006 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

150


RECORD


Invalid values

Ambiguity

Incompatible schema

ATTRIBUTE

Uninformative sequences

Undersized sequences

Annotation error

Dubious sequences

Sequence redundancy

Data provenance flaws






Among the 5,146,255 protein records queried using Entrez to the major protein or translated nucleotide databases , 3,327 protein sequences are shorter than four residues (as of Sep, 2004).

• In Nov 2004, the total number of undersized protein sequences increases to 3,350.

• Among 43,026,887 nucleotide records queried using Entrez to major nucleotide databases, 1,448 records contain sequences shorter than six bases (as of Sep, 2004).

• In Nov 2004, the total number of undersized nucleotide sequences increases to 1,711.

Undersized protein sequences in major databases

218 17142

528

116 151

1015

364 383

12351

1253 2 120 0 23

0

200

400

600

800

1000

1200

1 2 3

Sequence Length

Num

ber o

f rec

ords DDBJ

EMBL

GenBank

PDB

SwissProt

PIR

Undersized nucleotide sequences in major databases

2 3 924

5573 69

11581

233228

108 108

5167

640 45

77104

0

50

100

150

200

250

1 2 3 4 5

Sequence LengthNu

mbe

r of r

ecor

ds

DDBJEMBLGenBankPDB


Meaningless Seqs

151


RECORD


Invalid values

Ambiguity

Incompatible schema

ATTRIBUTE

Overlapping intron/exon

Annotation error

Dubious sequences

Sequence redundancy

Data Provenanceflaws






Syn7 gene of putative polyketide synthase in NCBI TPA record BN000507 has overlapping intron 5 and exon 6.

rpb7+ RNA polymerase II subunit in GENBANK record AF055916 has overlapping exon 1 and exon 2.

Overlapping Intron/Exons


152


RECORD


Invalid values

Ambiguity

Incompatible schema

ATTRIBUTE

Annotation error

Dubious sequences

Sequence redundancy

Data provenance flaws






Replication of sequence informationDifferent views

Overlapping annotations of the same sequence

• Submission of the same sequence to different databases

• Repeated submission of the same sequence to the same database

• Initially submitted by different groups

• Protein sequences may be translated from duplicate nucleotide sequences

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=11692005&dopt=GenPept

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=11692005&dopt=GenPept

Seqs w/ Identical Info


153


SINGLE-SOURCE DATABASE

ATTRIBUTE

Schema remapping

Sequence Structure Parser

RECORD

Concatenated values

Spelling errors

Format violation

Synonyms

Homonyms/Abbreviations

Misuse of fields

Features do not correspond with sequence

Dictionary lookup

Integrity constraints

Undersized sequences

Uninformative sequences

Replication of sequence information

Different views

Overlapping annotations of the same sequence

Fragments

Duplicate detection

Mis-fielded values

Comparative analysis


MULTI-SOURCEDATABASE

Vector contaminated sequencesVector screening

Putative features


Trying our hands on

data cleansing problems

154


Association Rule Mining for De-

duplication

Compute similarity scores from known duplicate pairs

Select matching criteria

Generate association rules

Detect duplicates using the rules

155


Scorpion venom datasetcontaining 520 records

695 duplicate pairs are collectively identified.

Snake PLA2 venom dataset containing 780 records

Entrez (GenBank, GenPept, SwissProt, DDBJ, PIR, PDB)

251 duplicate pairs 444 duplicate pairs

scorpion AND (venom OR toxin) serpentes AND venom AND PLA2

Expert annotation

Dataset

156


Features to Match


157


AAG39642 AAG39643 AC0.9 LE1.0 DE1.0 DB1 SP1 RF1.0 PD0 FT1.0 SQ1.0

AAG39642 Q9GNG8 AC0.1 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT0.1 SQ1.0

P00599 PSNJ1W AC0.2 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0

P01486 NTSREB AC0.0 LE1.0 DE0.3 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0

O57385 CAA11159 AC0.1 LE1.0 DE0.5 DB0 SP1 RF0.0 PD0 FT0.1 SQ1.0

S32792 P24663 AC0.0 LE1.0 DE0.4 DB0 SP1 RF0.5 PD0 FT1.0 SQ1.0

P45629 S53330 AC0.0 LE1.0 DE0.2 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0

LE1.0 PD0 SQ1.0 (99.7%)SP1 PD0 SQ1.0 (97.1%)SP1 LE1.0 PD0 SQ1.0 (96.8%)DB0 PD0 SQ1.0 (93.1%)DB0 LE1.0 PD0 SQ1.0 (92.8%)DB0 SP1 PD0 SQ1.0 (90.4%)DB0 SP1 LE1.0 PD0 SQ1.0 (90.1%)RF1.0 SP1 LE1.0 PD0 SQ1.0 (47.6%)RF1.0 DB0 LE1.0 PD0 SQ1.0 (44.0%)

Similarity scores of known

duplicate pairs

Association rule mining

Frequent item-set with support

Association Rule Mining

158


Duplicates detected by association rules

6

36.3

0.3

49.4

5.2

32.7

0.11.8 2.4 3.8 5.7 7.5 7.9 9.4

0

10

20

30

40

50

60

Rule 1Rule 2

Rule 3Rule 4

Rule 5Rule 6

Rule 7

Association rulesFP

% a

nd F

N%

FP% FN% x 1000

S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.1%)Rule 7

S(Seq)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.4%)Rule 6

S(Seq)=1 ^ M(Seq Length)=1 ^ M(PDB)=0 ^ M(DB)=0 (92.8%)Rule 5

S(Seq)=1^ M(PDB)=0 ^ M(DB)=0 (93.1%)Rule 4

S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%)Rule 3

S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%)Rule 2

S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%)Rule 1

Results


Rule 1. Identical sequences with the same sequence length and not originated from PDB are 99.7% likely to be duplicates.

Rule 2. Identical sequences with the same sequence length and of the same species are 97.1% likely to be duplicates.

Rule 3. Identical sequences with the same sequence length, of the same species and not originated from PDB are 96.8% likely to be duplicates.

159


Completeness:Is the structure of our db expressive

enough to capture critical info explicitly?

• Take a key paper such as the Kohn paper that summarises current knowledge on p53 regulation

• Is there a structured database that is able to capture all info in that paper explicitly?

• Is there a semi-structured database that is able to capture all info in that paper explicitly?

• How well does this (semi-) structured database generalize to other similar type of papers?

160


Understandability:Is our db or search results understandable?• Take a search on p53. You will get >300k hits or

some number like that on MEDLINE• It is not feasible for anyone to go thru all of that

to find what he wants! And this problem is growing bigger as MEDLINE doubles every 1-2 year.

• Need to organize the database and/or the search results into hierarchy or “semantic” net to make it easier for users to understand or to browse the results

• How do we define this hierarchy/net? • Can this hierarchy/net be self-organized?

161



• Bio entity name recognition• Bio interaction extraction• Extraction of other relevant info• Information integration• And so on …

162


Bio Name Recognition

• Nomenclature loosely followed• Frequent use of conjunction and disjunction in

bio names with multiple bio-entity names sharing one head noun

• Long descriptive names• Names of genes and proteins used

interchangeably

163


• Inherent complexity of biological interactions

⇒ Sentences describing them also tend to be complicated

Bio-Interaction Extraction

164


Bio-Interaction Extraction

• Domain knowledge is often needed for interaction template filling

165


Extraction of Other Relevant Info

• Contextual information– Species, cell type, cellular localisation, etc

• Negative information

• Speculative & incomplete facts

166


Information Integration

• Bio-name mapping

• Bio-interaction mapping– how do you know two complex sentences are

talking about the same interaction?

167



• Hirschman et al., Accomplishments and challenges in literature data mining for biology. Bioinformatics, 18:1553-1561, 2002

• Ng & Tan, Challenges in biological literature mining for online discovery of molecular interaction pathways, IJCAT, 2006. (To appear)

• Zhou et al., Recognizing names in biomedical texts: a machine learning approach, Bioinformatics, 20:1178-1190, 2004

• Blaschke et al., Do you text? Bioinformatics, 21:4199-4200, 2005


Discovering Motif Pairs at Interaction Sites

from Protein Sequences on a

Proteome-Wide Scale

60 min

169


Proteins & Their Interactions

• 4 types of reps for proteins: primary, secondary, tertiary, & quaternay

• Protein interactions play impt role in inter cellular communication, in signal transduction, & in the regulation of gene expression

Courtesy of JE Wampler

170


Binding Sites

• Discovery of binding sites is a key part of understanding mechanisms of protein interactions

• Structure-based approaches– E.g., docking– Relatively accurate– Struct must be known

⇒ Sequence-based approaches

171


Typical Sequence-Based Approach

• Typical seq-based approaches have two steps:– Use pattern discovery algorithms to discover

domains and/or motifs of a group of proteins– Use domain-domain interaction discovery

methods (e.g., domain fusion) to discovery interacting domains

• Shortcomings:– Protein interaction information is not used by motif

discovery algorithms– Exact positions of binding sites often not

recognized

172


How about ...

• How about making use of known protein-protein bindings to guide the discovery of binding motifs?

173


Yeast SH3 domain-domainInteraction network: 394 edges, 206 nodes

Tong et al. Science, v295. 2002

8 proteins containing SH35 binding at least 6 of them

Protein Interaction Graphs

174


Bipartite Subgraphs

SH3 Proteins SH3-BindingProteins

The larger this group,the more likely theiractive sites will showup clearly in a multiplealignment?

175


Problem StatementGiven a PPI expt E, the problem is

(1) To find all pairs X, Y of interacting protein groups, so that

(1.1) X and have full mutual interactions(1.2) X and Y are as large as possible

&

(2) To identify “good” binding motif pairs from these pairs of interacting protein groups

176


PPI Expt As a Graph

• PPI expt E as undirected graph GE = ⟨VE, DE⟩, – where VE are the proteins and DE the

edges,– so that two proteins are connected in GE iff

there is a binding betw them in PPI expt E

• Let LE(p) denote neighborhood of protein p in GE

• Let LE(P) = ⎧⎫p∈P LE(p) denote the common neighborhood of all proteins in P in GE

177


Maximality

• Proposition 2.1Let E be a PPI expt. Let X,Y be a pair of protein groups so that X =

LE(Y) and Y = LE(X). Let X’,Y’ be another pair of protein groups so that

X’ = LE(Y’), Y’ = LE(X’), X’⊆ X, & Y’ ⊆ Y. Then X = X’ and Y = Y’.

⇒ In other words, if X = LE(Y) and Y = LE(X), then X,Y is a maximal pair of protein groups that have full interactions

178


Recasting to Graph Theory

• X, Y is a pair of interacting protein groups in PPI exptE iff X = LE(Y) and Y = LE(X) 1.1

1.2

179


Max Complete Bipartite Subgraph

• A graph H = ⟨V1∪V2, DH⟩ is a maximal complete bipartite subgraph of G iff– H is a subgraph of G, – V1 × V2 = DH, – V1 ∩ V2 = {}, &– There is no H’ = ⟨V’1∪V’2, DH’⟩ with V1 ⊂ V’1

& V2 ⊂ V’2 that has the same properties above

180


The Connection to Graph Theory

• X, Y is a pair of interacting protein groups in PPI exptE iff H = ⟨X ∪Y, X ×Y ⟩ is max complete bipartite subgraphof GE

• Let H = ⟨X ∪Y, DE| X ∪Y⟩ be subgraph of GE with X,Y a pair of interacting protein groups

⇒ X = LE(Y) and Y = LE(X) ⇒ Full interactions betw X & Y⇒ X × Y = DE| X ∪Y

• By excluding self-binding, we have X ∩ Y = {}

• By Prop 2.1, we have H is max

181


Therefore … But ...

• Therefore, to find pairs of interacting protein groups, we can use algorithms from graph theory for enumerating maximal complete bipartite subgraphs

• According to Eppstein 1994, this has complexity O(a322an), where a is the aboricity of the graph and n the number of vertices

• This is inefficient because a is often around 10-20 in practice

182


From PPI Expts To Transactions

• In PPI expt E, we obtain for each protein p, a list LE(p) of proteins that bind p– assume p∉LE(p), as such expts are not intended to detect self-binding

– assume q∈LE(p) implies p∈ LE(q), as binding is symmetric

• LE(p) can be thought of as a transaction & tE(p) as the “id” of this transaction

⇒ E can be thought of as generating a db of transactions DE = {tE(p1), …, tE(pk)}, where p1, …, pkare all the proteins involved in E

⇒ a set of proteins X can be thought of as a patternin DE if there is tE(p)∈ DE st X ⊆ LE(p)

183


Example

• Consider expt E with 5 proteins p1, …, p5, st p2and p3 bind every protein except themselves

• Then DE looks like this (as a matrix):

p1

p2

p3

p4

p5

184


Notations

• Let sE(d) denote the protein p st tE(p) = d⇒ sE(tE(p)) = p• Let [|p|]E denote the set {tE(q) | p∈LE(q)} of

transactions in DE in which p occurs⇒ tE(p)∈ [|q|]E implies tE(q)∈[|p|]E

• Let [|X|]E denote the set ⎧⎫p∈X [|p|]E of transactions in which the pattern X occurs

⇒ tE(Y)⊆[|X|]E implies tE(X)⊆[|Y|]E

• Let tE(X) denote the set {tE(p) | p∈ X} of transaction id’s, where X is a pattern in DE

• Let sE(T) denote the pattern {sE(d) | d∈T}⇒sE[|X|]E = ⎧⎫p∈X LE(p) = LE(X)

185


Closed Patterns

• Let [X]E = {Y | [|Y|]E = [|X|]E} denote the equivalence class of the pattern X in DE

• A pattern X is said to be a closed pattern of DE iff X = closedE(X), where {closedE(X)} = max [X]E

186


Key Proposition

• Proposition 3.2Let X be a closed pattern in DE.Then X = sE [|sE [|X|]E |]E

X

[|X|]

sE[|X|]E

[|sE[|X|]E|]E

187


Proof

188


Consequently...

• Corollary 3.4Let X and Y be closed pattern in DE.Then X = Y iff sE [|X|]E = sE [|Y|]E

• Proposition 3.5For any pattern X, we have X ∩ sE [|X|]E = {}

189


Implication

• Corollary 3.6Let E be a PPI expt.Let C be the set of closed patterns of DE.Then |C| is even

190


More Important Implication

• Theorem 3.7The bijective function sEo[|·|]E partitions the set of closed patterns of DE into bipartite graphs

p1

p2

p3

p4

p5

191


Even More Interesting...

• For each pair X and Y, – X is the largest group of

proteins that bind all the proteins in Y; and

– Y is the largest group of proteins that bind all the proteins in X

⇒ X,Y is a pair of interacting protein groups

192


A Couple More Propositions

• Proposition 3.3For any pattern X, sE[|X|]E is closed pattern in DE

• Corollary 3.7X is a closed pattern in DE iff X = sE[|sE[|X|]E|]E

“the common neighbours of the common neighbours of X

are X themselves.”

“each side of a protein interaction group is a

closed pattern”

“X is one side of a protein interaction

group”

Recall that sE[|X|]E = LE(X)

193


At Last!

• These are ALL the interacting protein groups

⇒ To mine these protein groups, it suffices to mine closed patterns in DE

194


An Extension

• Not all interacting protein groups X, Y are equally interesting– X and Y are both singleton, vs– X is a large group, Y is small group, vs– X is a large group, Y is a large group

⇒ Set “interestingness” threshold on X, Y st a pair of interacting protein groups X, Y is interesting only if |X| ≥ m and |Y| ≥ n

195


An Optimization

• Let X, Y be a pair of interacting protein groups– By Theorem 3.7, X = sE [|Y|]E and Y = sE [|X|]E

– By Definition of [|•|]E, |X| = times Y occurs in DE

– By Definition of [|•|]E, |Y| = times X occurs in DE

⇒ To mine interesting pairs X, Y of interacting protein groups in an expt E such that |X| ≥ m and |Y| ≥ n, it suffices to mine closed patterns X that appears ≥ n times in DE and |X| ≥ m

196


Closed Pattern Mining Algorithms

• CLOSET, Pei et al. 2000• CARPENTER, Pan et al. 2003• FPclose*, Grahne & Zhu 2003• GC-growth, Li et al. 2005• ...

⇒We have efficient algorithms for mining interesting interacting protein groups

197


Example Breitkreutz et al, Genome Biology, 4, R23, 2003X and sE[|X|]E both occur with freqat least that of support threshold

198


It now remains to generate the motif pairs …

199


Many Motif Discovery Methods

• MEME, Bailey & Elkan 1995

• CONSENSUS, Hertz & Stormo 1995

• PROTOMAT, Henikoff & Henikoff 1991

• CLUSTAL, Higgins & Sharp 1988

• …

• For illustration, we use PROTOMAT here

200


PROTOMAT• Core of Block Maker, a WWW server that return

blocks (ungapped multiple alignments) for any submitted set of protein sequences

• Comprises 2 steps:– MOTIF, Smith et al. 1990

• Look for spaced triplets in given set of proteins– MOTOMAT, Henikoff & Henikoff 1991

• Merge overlapping blocks produced by MOTIF• Extend blocks in both directions until similarity falls• Determine best set of blocks that are in the same order

and do not overlap

we treat every block, instead of whole set of blocks generated by PROTOMAT, as a binding motif

201


Example, Breitkreutz et al, Genome Biology, 4, R23, 2003

• Comprises 19038 genetic and physical interactions in yeast among 4907 proteins

• Look for interesting pairs with m = n = 5• About 1s to generate 60k closed patterns⇒ Too many for PROTOMAT. So consider only

maximal closed patterns, giving 7847 pairs• PROTOMAT produces 17256 left blocks and

19350 right blocks after 6 hours • Most groups yield 1 to 3 blocks• Ave length of blocks = 11.696, std dev = 5.45

202


Databases Used for Validation

• BLOCKS, Pietrokovski et al. 1996

• PRINTS, Attwood & Beck 1994

• Pfam, Sonnhammer et al. 1997

• InterDom, Ng et al. 2003

203


Validation for Single Motifs

• Compare all single motifs in our discovered motif pairs with all domains of specific domain databases – LAMA, Pietrokovski 1996– transform blocks into position-specific scoring matrices (PSSM)– run Smith-Waterman to align pairs of PSSM using Pearson

correlation coefficient to measure similarity betw 2 columns– a block is mapped to another block if 95% of positions in a block

occuring in the optimal alignment is common to another block and Z-score is > 5.6, where Z-score is the number std dev away from the mean generated by millions of shuffles of the BLOCKS database

• Determine number of motifs that can be mapped to these domains and the overall correlation in the portions that are mapped

204


Results for Single Motifs

• Our blocks map to 32% of blocks in BLOCKS and PRINTS, yet motifs from our blocks cover 72% of domains in BLOCKS and PRINTS

⇒ Maybe most domains in BLOCKS and PRINTS have less than half a block as binding motifs, or may not be related to binding behaviour

205


Validation for Motif Pairs

• Map our motif pairs into domain-domain interacting pairs

• Determine the number of overlaps between our motif pairs and those in the domain-domain interaction database

• Use InterDom as the domain-domain interaction database 30037

interactionsamong3535 domains

206


Linking Our Motif Pairs to InterDom

• InterDom represents domains by Pfam entries

⇒ To x-link, we have to– Map our motifs to blocks in BLOCKS and PRINTS– Link from BLOCKS and PRINTS to InterPro– Link from InterPro to Pfam– Match Pfam to InterDom

207


Results for Motif Pairs

Domain-domain interactionsinferred from protein complexesor from interactions betweensingle domain proteins

Both sidesmapped to BLOCKS

Both sidesmapped to PRINTS

One side mapped to PRINTS,one side mapped to BLOCKS

208


Example Confirmed Binding Motif

• 1 of the 241 binding motifs we found that can be confirmed using protein complexes is #1781...

As shown in the next slide, this pair corresponds to interactionsites between LSM domains. E.g., all 7 pairs of adjacent

LSM domains of pdb1mgq exhibits it.

209


Example: LSM Domainsof pdb1mgq

210


Sequence Identity Within a Group

• Ave seq identity within a group is 7.5%

⇒Group is unlikely to be detected by standard methods based on seqhomology

211


Conclusions

• Connection between maximal complete bipartite subgraphs and closed patterns

⇒ Closed pattern mining algorithms can be used to enumerate maximal complete bipartite subgraphsefficiently

• Connection between pairs of interacting protein groups and closed patterns

⇒ Discovery of binding motifs is accelerated because we need not execute expensive motif discovery algorithms on insignificant groups

212


References

• Haiquan Li, Jinyan Li, Limsoon Wong. Discovering Motif Pairs at Interaction Sites from Protein Sequences on a Proteom-Wide Scale. Bioinformatics, 22(8):989--996, 2006

• Jinyan Li, Haiquan Li, Donny Soh, Limsoon Wong. A Correspondence Between Maximal Complete Bipartite Subgraphs and Closed Patterns. Proceedings of 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 146--156, Porto, Portugal, October 2005

• Smith et al., Finding Sequence Motifs in Groups of Functionally Related Proteins. PNAS, 87:826-830, 1990

• Henikoff & Henikoff. Automated assembly of protein blocks for database searching. NAR, 19:6565-6572, 1991


Advancing Knowledge Discovery

10 min

214


Some of those “techniques”

frequently needed in analysis of

biomedical data are insufficiently

studied by current data mining researchers

• Recognizing what samples are relevant and what are not

• Recognizing what features are relevant and what are not & handling missing or incorrect values

• Recognizing trends, changes, and their causes

215


Going Beyond Frequent Patterns:Marrying Statisticians and Data Miners

• Statisticians use a battery of “interestingness”measures to decide if a feature/factor is relevant

• Examples:– Odds ratio– Relative risk– Gini index– Yule’s Q & Y– etc

• Odds ratio

−−

−

−=

,

,

,

,

),(

D

eD

dD

edD

PPPP

DPOR

216


Going Beyond Frequent Patterns:Experimenting With Tipping Events

• Given a data set, such as those related to human health, it is interesting to determine impt cohorts and impt factors causing transition betw cohorts

⇒ Tipping events⇒ Tipping factors are “action

items” for causing transitions

• “Tipping event” is two or more population cohorts that are significantly different from each other

• “Tipping factors” (TF) are small patterns whose presence or absence causes significant difference in population cohorts

• “Tipping base” (TB) is the pattern shared by the cohorts in a tipping event

• “Tipping point” (TP) is the combination of TB and a TF


Any Question?

218


Acknowledgements• Disease Treatment Planning & Understanding

– Jinyan Li, Huiqing Liu– Allen Yeoh, Richie Soong, Benjamin Mou, Anthony Tung

• Gene Feature Recognition– Huiqing Liu, Rajesh Chowdhary, Vladimir Bajic– Fanfan Zeng, Roland Yap, Liang Huai Yang, Yun Zheng,

Mong Li Lee, Wynne Hsu• Protein Function Inference & Binding Motifs

– Jinyan Li, Haiquan Li, See-Kiong Ng, Chris Tan, Donny Soh– Hon Nian Chua, Wing-Kin Sung, Jing Chen

• Text Mining from Biomedical Literature– See-Kiong Ng, Jian Su, Guodong Zhou– Chew Lim Tan

• Data Mining Theory & Technologies– Jinyan Li, Haiquan Li, Guimei Liu – Meng Ling Feng, Guozhu Dong

an introduction to knowledge discovery applications and...

Documents