an introduction to knowledge discovery applications and...

218
EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 An Introduction to Knowledge Discovery Applications and Challenges in Life Sciences Limsoon Wong

Upload: others

Post on 08-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006

An Introduction to Knowledge DiscoveryApplications and Challenges

in Life Sciences

Limsoon Wong

Page 2: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

2

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Plan

• Quick introduction to knowledge discovery (13 slides, 20 min)

• Apps and challenges of knowledge discovery– Protein function inference (20 slides, 40 min)

– Gene feature recognition (25 slides, 60 min)

– Protein subcellular localization prediction (20 slides, 40 min)

– Disease diagnosis, treatment, and understanding (40 slides, 80 min)

– Biomedical text mining (35 slides, 50 min)

– Discovery of protein binding motifs (40 slides, 60 min)

• Advancing knowledge discovery (3 slides, 10 min)

Page 3: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006

Quick Intro to Knowledge Discovery

20 min

Page 4: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

4

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Jonathan’s rules : Blue or CircleJessica’s rules : All the rest

What is Knowledge Discovery?

Whose block is this?

Jonathan’s blocks

Jessica’s blocks

Page 5: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

5

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

What is Knowledge Discovery?

Question: Can you explain how?

Page 6: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

6

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Main Steps of Knowledge Discovery

• Training data gathering• Feature generation

– k-grams, colour, texture, domain know-how, ...• Feature selection

– Entropy, χ2, CFS, t-test, domain know-how...• Feature integration

– SVM, ANN, PCL, CART, C4.5, kNN, ...

Someclassifier/methods

Page 7: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

7

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

An Abstract Model of a Classifier

• Given a test sample S• Compute scores p(S), n(S)• Predict S as negative if p(S) < t * n(s)• Predict S as positive if p(S) ≥ t * n(s)

t is the decision threshold of the classifier

Page 8: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

8

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Neighborhood

5 of class3 of class

=

kNN (k=8)

Image credit: Zaki

Page 9: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

9

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

kNN

• Given a sample S, find the k observations Si in the known data that are “closest” to it, and average their responses

• Assume S is well approximated by its neighbours

p(S) = Σ 1Si ∈Nk(S) ∩ DP

n(S) = Σ 1Si ∈Nk(S) ∩ DN

where Nk(S) is the neighbourhood of Sdefined by the k nearest samples to it.

Assume distance between samples is Euclidean distance for now

Page 10: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

10

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Curse of Dimensionality

• How much of each dimension is needed to cover a proportion r of total sample space?

• Calculate by ep(r) = r1/p

• So, to cover 1% of a 15-D space, need 85% of each dimension!

00.10.20.30.40.50.60.70.80.9

1

p=3 p=6 p=9 p=12 p=15

r=0.01r=0.1

Exercise: Why ep(r) = r1/p?

Page 11: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

11

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Consequence of the Curse

• Suppose the number of samples given to us in the total sample space is fixed

• Let the dimension increase

• Then the distance of the k nearest neighbours of any point increases

• Then the k nearest neighbours are less and less useful for prediction, and can confuse the k-NN classifier

Page 12: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

12

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Tackling the Curse

• Given a sample space of p dimensions

• It is possible that some dimensions are irrelevant

• Need to find ways to separate those dimensions (aka features) that are relevant (aka signals) from those that are irrelevant (aka noise)

Page 13: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

13

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Signal Selection (Basic Idea)

• Choose a feature w/ low intra-class distance• Choose a feature w/ high inter-class distance

Page 14: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

14

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Signal Selection (e.g., t-statistics)

Page 15: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

15

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Self-fulfilling Oracle

• Construct artificial dataset with 100 samples, each with 100,000 randomly generated features and randomly assigned class labels

• select 20 features with the best t-statistics (or other methods)

• Evaluate accuracy by cross validation using only the 20 selected features

• The resultant estimated accuracy can be ~90%

• But the true accuracy should be 50%, as the data were derived randomly

Page 16: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

16

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

What Went Wrong?

• The 20 features were selected from the whole dataset

• Information in the held-out testing samples has thus been “leaked” to the training process

• The correct way is to re-select the 20 features at each fold; better still, use a totally new set of samples for testing

Page 17: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006

Protein Function Inference

40 min

Page 18: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

18

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Sequence Alignment

• Key aspect of seqcomparison is seqalignment

• A seq alignment maximizes the number of positions that are in agreement in two sequences

Sequence U

Sequence V

mismatch

match

indel

Page 19: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

19

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Sequence Alignment: Poor Example

• Poor seq alignment shows few matched positions⇒ The two proteins are not likely to be homologous

No obvious match between Amicyanin and Ascorbate Oxidase

Page 20: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

20

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Sequence Alignment: Good Example

• Good alignment usually has clusters of extensive matched positions

⇒ The two proteins are likely to be homologous

good match between Amicyanin and unknown M. loti protein

Page 21: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

21

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Multiple Alignment: An Example

• Multiple seq alignment maximizes number of positions in agreement across several seqs

• seqs belonging to same “family” usually have more conserved positions in a multiple seqalignment

Conserved sites

Page 22: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

22

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Function Assignment to Protein Seq

• How do we attempt to assign a function to a new protein sequence?

SPSTNRKYPPLPVDKLEEEINRRMADDNKLFREEFNALPACPIQATCEAASKEENKEKNRYVNILPYDHSRVHLTPVEGVPDSDYINASFINGYQEKNKFIAAQGPKEETVNDFWRMIWEQNTATIVMVTNLKERKECKCAQYWPDQGCWTYGNVRVSVEDVTVLVDYTVRKFCIQQVGDVTNRKPQRLITQFHFTSWPDFGVPFTPIGMLKFLKKVKACNPQYAGAIVVHCSAGVGRTGTFVVIDAMLDMMHSERKVDVYGFVSRIRAQRCQMVQTDMQYVFIYQALLEHYLYGDTELEVT

Page 23: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

23

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Guilt-by-AssociationCompare T with seqs of known function in a db

Assign to T same function as homologs

Confirm with suitable wet experiments

Discard this functionas a candidate

Page 24: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

24

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

BLAST: How It WorksAltschul et al., JMB, 215:403--410, 1990

• BLAST is one of the most popular tool for doing “guilt-by-association” sequence homology search

find from db seqswith short perfectmatches to queryseq

find seqs withgood flanking alignment

Page 25: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

25

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Page 26: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

26

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Page 27: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

27

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Page 28: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

28

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Homologs obtained by BLAST

• Thus our example sequence could be a protein tyrosine phosphatase α (PTPα)

Page 29: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

29

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Example Alignment with PTPα

Page 30: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

30

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Guilt-by-Association: Caveats

• Ensure that the effects of database size and composition have been accounted for

• Ensure that the function of the homology is not derived via invalid “transitive assignment’’

• Ensure that the target sequence has all the key features associated with the function, e.g., active site and/or domain

Page 31: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

31

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Law of Large Numbers

• Suppose you are in a room with 365 other people

• Q: What is the prob that a specific person in the room has the same birthday as you?

• A: 1/365 = 0.3%

• Q: What is the prob that there is a person in the room having the same birthday as you?

• A: 1 – (364/365)365 = 63%

• Q: What is the prob that there are two persons in the room having the same birthday?

• A: 100%

Page 32: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

32

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Interpretation of P-value

• Seq. comparison progs, e.g. BLAST, often associate a P-value to each hit

• P-value is interpreted as prob that a random seqhas an equally good alignment

• Suppose the P-value of an alignment is 10-6

• If database has 107 seqs, then you expect 107 * 10-6 = 10 seqs in it that give an equally good alignment

⇒ Need to correct for database size if your seqcomparison prog does not do that!

Page 33: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

33

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Lightning Does Strike Twice!

• Roy Sullivan, a former park ranger from Virgina, was struck by lightning 7 times– 1942 (lost big-toe nail)– 1969 (lost eyebrows)– 1970 (left shoulder seared)– 1972 (hair set on fire)– 1973 (hair set on fire & legs seared)– 1976 (ankle injured)– 1977 (chest & stomach burned)

• September 1983, he committed suicideCartoon: Ron Hipschman

Data: David Hand

Page 34: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

34

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Effect of Seq Compositional Bias

• One fourth of all residues in protein seqs occur in regions with biased amino acid composition

• Alignments of two such regions achieves high score purely due to segment composition

• While it is worth noting that two proteins contain similar low complexity regions, they are best excluded when constructing alignments

• BLAST employs the SEG algorithm to filter low complexity regions from proteins before executing a search

Source: NCBI

Page 35: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

35

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Effect of Sequence Length

Source: Abagyan & Batalov

Distribution of seq identity vs length of unrelated proteins

Page 36: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

36

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Important Unsolved Challenges

• What if there is no useful sequence homolog?• Guilt by other types of association!

– Domain modeling (e.g., HMMPFAM)– Similarity of dissimilarities (e.g., SVM-PAIRWISE)– Similarity of phylogenetic profiles– Similarity of subcellular co-localization & other

physico-chemico properties(e.g., PROTFUN)– Similarity of gene expression profiles– Similarity of protein-protein interaction partners– …– Fusion of multiple types of info

Page 37: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

37

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Fusion of Multiple Types of Info

Model each data source into a graph

Combine graphs from diff sources to form larger graph

Page 38: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

38

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Recommended Readings

• Wong, The Practical Bioinformatician, 2004, ICP. Chapters 10 & 19

• Chua et al., Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics, 2006. (To appear)

• Bateman et al., The Pfam protein families database. NAR, 32:D138-D141, 2004

• Jaakkola et al., A discriminative framework for detecting remote protein homologies. JCB, 7:95-114, 2000

• Jensen et al., Prediction of human protein function from post-translational modifications and localization features. JMB, 319:1257-1265, 2002

• Pellegrini et al., Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. PNAS, 96:4285-4288, 1999

Page 39: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006

Gene Feature Recognition

60 min

Page 40: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

40

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

...AATGGTACCGATGACCTG... ...TRLRPLLALLALWP......AAUGGUACCGAUGACCUGGAGC...

Central Dogma

Page 41: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

41

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Translation Initiation Site

Page 42: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

42

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

A Sample cDNA

• What makes the second ATG the TIS?

299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT............................................................ 80................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

Page 43: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

43

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Approach

• Training data gathering• Signal generation

– k-grams, distance, domain know-how, ...• Signal selection

– Entropy, χ2, CFS, t-test, domain know-how...• Signal integration

– SVM, ANN, PCL, CART, C4.5, kNN, ...

Page 44: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

44

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Training & Testing Data

• Vertebrate dataset of Pedersen & Nielsen [ISMB’97]

• 3312 sequences• 13503 ATG sites• 3312 (24.5%) are TIS• 10191 (75.5%) are non-TIS• Use for 3-fold x-validation expts

Page 45: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

45

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Signal Generation

• K-grams (ie., k consecutive letters)– K = 1, 2, 3, 4, 5, …– Window size vs. fixed position– Up-stream, downstream vs. any where in window– In-frame vs. any frame

0

0.5

1

1.5

2

2.5

3

A C G T

seq1seq2seq3

Page 46: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

46

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Signal Generation: An Example

• Window = ±100 bases• In-frame, downstream

– GCT = 1, TTT = 1, ATG = 1…• Any-frame, downstream

– GCT = 3, TTT = 2, ATG = 2…• In-frame, upstream

– GCT = 2, TTT = 0, ATG = 0, ...

299 HSU27655.1 CAT U27655 Homo sapiensCGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT

Page 47: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

47

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Too Many Signals

• For each value of k, there are 4k * 3 * 2 k-grams

• If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 + 1536 + 6144 = 8184 features!

• This is too many for most machine learning algorithms

Page 48: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

48

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Signal Selection (Basic Idea)

• Choose a signal w/ low intra-class distance• Choose a signal w/ high inter-class distance

Page 49: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

49

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Signal Selection (e.g., CFS)

• Instead of scoring individual signals, how about scoring a group of signals as a whole?

• CFS– Correlation-based Feature Selection– A good group contains signals that are highly

correlated with the class, and yet uncorrelated with each other

Page 50: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

50

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Sample K-grams Selected by CFS

• Position –3• in-frame upstream ATG• in-frame downstream

– TAA, TAG, TGA, – CTG, GAC, GAG, and GCC

Kozak consensus Leaky scanning

Stop codon

Codon bias?

Page 51: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

51

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Signal Integration

• kNN– Given a test sample, find the k training samples

that are most similar to it. Let the majority class win

• SVM– Given a group of training samples from two

classes, determine a separating plane that maximises the margin of error

• Naïve Bayes, ANN, C4.5, ...

Page 52: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

52

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Results (3-fold x-validation)

TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy

Naïve Bayes 84.3% 86.1% 66.3% 85.7%

SVM 73.9% 93.2% 77.9% 88.5%

Neural Network 77.6% 93.2% 78.8% 89.4%

Decision Tree 74.0% 94.4% 81.1% 89.4%

Page 53: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

53

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Improvement by Scanning

• Apply Naïve Bayes or SVM left-to-right until first ATG predicted as positive. That’s the TIS

• Naïve Bayes & SVM models were trained using TIS vs. Up-stream ATG

TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy

NB 84.3% 86.1% 66.3% 85.7%

SVM 73.9% 93.2% 77.9% 88.5%

NB+Scanning 87.3% 96.1% 87.9% 93.9%

SVM+Scanning 88.5% 96.3% 88.6% 94.4%

Page 54: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

54

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

mRNA→Protein

F

L

I

MV

S

P

T

A

Y

H

Q

N

K

D

E

C

WR

G

AT

E

LR

S

stop

How about using k-grams from the translation?

Page 55: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

55

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Amino-Acid Features

Page 56: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

56

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Amino-Acid Features

Page 57: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

57

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Amino Acid K-grams Discovered (by Entropy)

Page 58: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

58

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Independent Validation Sets

• A. Hatzigeorgiou:– 480 fully sequenced human cDNAs– 188 left after eliminating sequences similar to

training set (Pedersen & Nielsen’s)– 3.42% of ATGs are TIS

• Our own:– well characterized human gene sequences from

chromosome X (565 TIS) and chromosome 21 (180 TIS)

Page 59: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

59

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Validation Results (on Hatzigeorgiou’s)

– Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s dataset

Page 60: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

60

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

ATGpr

Ourmethod

Validation Results (on Chr X and Chr 21)

• Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s

Page 61: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

61

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Important Unsolved Challenges

• Other gene features: TSS, PolyA, Splicing, …

• Gene regulation: TFBS, EREs, miRNA targets, …

• “Dark matters”: miRNA & non-coding genes

Page 62: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

62

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

TSS: Is State of the Art Good Enough?

Page 63: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

63

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

TFs and TFBSs

• Recognize TFBS– Short & degenerate– Incomplete erroneous data

• Characterize type of regulation, e.g., suppressor vs enhancer

• Identify synergistic TF pairs– Genes that are coordinately

bound are coordinately expressed

– Requires data over many time points

– TFs may interact only under certain conditions

• A method to identify synergistic TF pairs in cell cycle– Study if a pair of TFs are

assoc with the same genes more often than random

– For each significant pair, test whether there is a phase in cell cycle where the common associated genes are significantly diff in other cell cycle phases

⇒ Quite primitive. Lots more work needed!

Page 64: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

64

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

miRNA: New Frontier in Molecular Biology

• Impt functions of miRNA: – Control of cell fate and fat

metabolism in flies– Neuronal patterning in

nematodes– Modulation of

hematopoietic lineage differentiation in mammals

– control of leaf and flower development in plants

• By regulating gene expr in binding targets:– Repression/inhibition of

translation– Degradation of mRNA

• miRNA identification thru direct cloning has limitations:– miRNAs may be expressed

only in certain tissues or at certain times

– Expr levels vary greatly – Degradation products from

mRNAs and other endogenous non-coding RNAs coexist w/ miRNAsand are sometimes dominant in small RNA molecule samples extracted from cells

Page 65: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

65

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Recommended Readings

• Wong, The practical Bioinformatician, 2004, ICP. Chapters 4, 5, 6, & 7

• Liu & Wong, Data mining tools for biological sequences. JBCB, 1:139-168, 2003

• Tsai et al., Statistical methods for identifying yeast cell cycle transcription factors. PNAS, 102:13532-13537, 2005

• Yang et al., Identification of microRNA precursors via SVM. APBC 2006, pages 267-276

Page 66: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006

Protein SubcellularLocalization Prediction

40 min

Page 67: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

67

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Compartments and Sorting

• Eukaryotic cells requires proteins be targeted to their subcellulardestinations

• Protein sorting is determined by specific amino acid sequences, or “signals”, within the protein

• Secretory pathway targets proteins to plasma membrane, some membrane-bound organelles such as lysosomes, or to export proteins from the cell

Page 68: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

68

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Datasets

• Reinhartdt & Hubbard, NAR, 26:2230--2236, 1998– 2427 eukaryotic proteins for 4 locations (cytoplasmic, extracellular,

nuclear,& mitochondrial)– 997 prokaryotic proteins for 3 locations (cytoplasmic, extracellular, &

periplasmic)

• Park & Kanehisa, Bioinformatics, 19:1656--1663, 2003

– 7589 eukaryotic proteins from 709 organisms for 12 locations (chloroplast, cytoplasmic, cytoskeleton, ER, extracellular, golgi, lysosomal, mitochondrial, nuclear, peroxisomal, plasma membrane, vacuolar)

• Chou & Cai, JBC, 277:45765--45769, 2002– 2191 proteins for 12 locations

• Emanuelsson et al., JMB, 300:1005--1016, 2000• Gardy et al., NAR, 31:3613--3617, 2003

Page 69: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

69

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Common Eukaryotic Protein Sorting Signals

For a comprehensive list of cellular localization sites, seehttp://mendel.imp.univie.ac.at/CELL_LOC/index.html

Page 70: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

70

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Schematic View of SortingSignals

cleavage site

~25aa

Page 71: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

71

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Sequence Logos ofSP, mTP, & cTP

SPsignal peptide

mTPmitochondrial

transfer peptide

cTPchloroplast

transit peptide

Page 72: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

72

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Expert System Approach: PSORT Horton & Nakai, ISMB, 1997

A simplified version of the decision tree thatPSORT uses tocheck and reasonover various sorting signals

Page 73: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

73

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Amino acid composition of

proteins residing in

different sites are different

Page 74: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

74

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Amino Acid Composition Differences

• each cellular location has own characteristic physio-chemical environment

• proteins in each location have adapted thru evolution to that environment

• thus reflected in the protein structure and amino acid composition

• If the above is true, the amino acid composition differences wrt cellular location sites should be more pronounced on protein surfaces than protein interior

• Exercise: Why?

Page 75: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

75

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Adaptation of Protein SurfacesAndrade et al., JMB, 1998

• To test the theory of adaptation of protein surfaces to subcellularlocalization, we do a plot of 3 types of composition vectors along their first two principal components

Proportion ofjth amino acid type in ith protein

Page 76: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

76

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Adaptation of Protein Surfaces Andrade et al., JMB, 1998

• Clearly total & surface composition vectors show better separation than interior composition vectors

Total amino acidcomposition vector

Interior amino acidcomposition vector

Surface amino acidcomposition vector

nuclearcytoextracell

Page 77: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

77

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Amino Acid Composition

• This means can use amino acid composition vectors, especially those from protein surfaces, to predict subcellular localization!

• Let’s see how this turn out….

Page 78: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

78

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Neural Networks: NNPSLReinhardt & Hubbard, NAR, 26:2230--2236, 1998

Input1

Input20

cytoplasmic

extracellular

mitochodrial

nuclear

fraction of each aminoacid in the input protein

Page 79: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

79

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

NNPSL:Performance

• Outputs of NNPSL have values 0 to 1. The difference (Δ) between the highest and the next highest nodes can be used as a reliability index

0 < Δ < 0.2

0.2 < Δ < 0.4

0.4 < Δ < 0.6

0.6 < Δ < 0.8

0.8 < Δ < 1

Dataset: Reinhardt & Hubbard,NAR, 1998

Page 80: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

80

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Support Vector Machines: SubLocHua & Sun, Bioinformatics, 17:721--728, 2001

extracellularvs rest

nuclearvs rest

cytoplasmicvs rest

mitochondrialvs rest

ArgmaxX X-vs-rest

SVM

SVM

SVM

SVMThe SVMs use • polynomial kernel with d = 9 (prokaryotic),

K(Xi,Xj) = (Xi ·Xj + 1)d

• RBF kernel with γ=16 (eukaryotic),K(Xi, Xj) = exp(- γ |Xi - Xj|2)

20-dimensional vector giving amino

acid composition of the input protein

Page 81: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

81

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

SubLoc:

PerformanceNNPSL SubLoc

(Eukaryotic)

Dataset: Reinhardt & Hubbard, NAR, 1998

Page 82: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

82

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

SubLoc: Robustness of Amino Acid Composition Approach

• Amazingly, accuracy of SubLoc is virtually unaffected when the first 10, 20, 30, & 40 amino acids in a protein are deleted

• Amino acid composition is a robust indicator of subcellularlocalization, and is insensitive to errors in N-terminal sequences

Page 83: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

83

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Amino Acid Composition:Taking it Further

• How about pairs of consecutive amino acids? (a.k.a 2-grams) How about 3-grams, …, k-grams?

• How about pseudo amino acid composition?• How about presence of entire functional

domains? (I.e. think of the presence/absence of a functional domain as a summary of amino acid sequence info...)

Page 84: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

84

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Functional Domain CompositionCai & Chou, BBRC, 305:407--411, 2003

Training seqs of various localizationsites

BLAST againstdb of known functional domains(Interpro)

NN-5875D:Train k-NN (k=1) using these vectors

or, if nohit found

Pseudo aminoacid composition

Aminoacidcomposition

NN-40D:Train k-NN (k=1) using these vectors

If a protein got a hit in Interpro,use NN-5875D; else use NN-40D

Page 85: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

85

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Functional Domain Composition:

Performance

Dataset: Reinhardt & Hubbard, NAR, 1998

Page 86: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

86

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Recommended Readings• Horton & Nakai, “Better prediction of protein cellular localization sites

with the k-nearest neighbours classifier”, ISMB, 5:147--152, 1997

• Gardy et al., “PSORT-B: Improving protein subcellular localization for Gram-negative bacteria”, NAR, 31:3613--3617, 2003

• Emanuelsson, “Predicting protein subcellular localization from amino acid sequence information”, BIB, 3:361--376, 2002

• Andrade et al., “Adaptation of protein surfaces to subcellular location”, JMB, 276:517--525, 1998

• Yuan, “Prediction of protein subcellular locations using Markov chain models”, FEBS Letters, 451:23--26, 1999

• Hua & Sun, “Support vector machine approach for protein subcellularlocalization prediction”, Bioinformatics, 17:721--728, 2001

• Park & Kanehisa, “Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs”, Bioinformatics, 19:1656--1663, 2003

Page 87: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006

Disease Diagnosis, Treatment, & Understanding

80 min

Page 88: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

88

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

• The subtypes look similar

• Conventional diagnosis– Immunophenotyping– Cytogenetics– Molecular diagnostics

Childhood Acute Lymphoblastic Leukemia

• Major subtypes: T-ALL, E2A-PBX, TEL-AML, BCR-ABL, MLL genome rearrangements, Hyperdiploid>50

• Diff subtypes respond differently to same Tx

• Over-intensive Tx– Development of

secondary cancers– Reduction of IQ

• Under-intensiveTx– Relapse

Page 89: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

89

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Childhood ALL Cure Rates

• Conventional risk assignment procedure requires difficult expensive tests and collective judgement of multiple specialists

⇒ Not available in less advanced ASEAN countries

⇒ Can we have a single-test easy-to-use platform instead?0% 20% 40% 60% 80%

singapore

malaysia

indonesia

philippines

thailand

vietnam

cambodia cure rate

Page 90: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

90

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Single-Test Easy-to-Use Platform

Page 91: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

91

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

A Sample Affymetrix GeneChipData File (U95A)

Page 92: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

92

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Overall Strategy

• For each subtype, select genes to develop classification model for diagnosing that subtype

• For each subtype, select genes to develop prediction model for prognosis of that subtype

Diagnosis of subtype

Subtype-dependentprognosis

Risk-stratifiedtreatmentintensity

Page 93: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

93

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Subtype Diagnosis by PCL

• Gene expression data collection

• Gene selection by χ2

• Classifier training by emerging pattern

• Classifier tuning (optional for some machine learning methods)

• Apply classifier for diagnosis of future cases by PCL

Page 94: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

94

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Childhood ALL Subtype Diagnosis Workflow

A tree-structureddiagnostic workflow was recommended byour doctor collaborator

Page 95: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

95

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Training and Testing Sets

Page 96: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

96

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Signal Selection Basic Idea

• Choose a signal w/ low intra-class distance• Choose a signal w/ high inter-class distance

Page 97: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

97

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Signal Selection by χ2

Page 98: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

98

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Emerging Patterns

• An emerging pattern is a set of conditions– usually involving several features– that most members of a class satisfy – but none or few of the other class satisfy

• A jumping emerging pattern is an emerging pattern that – some members of a class satisfy– but no members of the other class satisfy

• We use only jumping emerging patterns

Page 99: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

99

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Examples

Reference number 9: the expression of gene 37720_at > 215Reference number 36: the expression of gene 38028_at <= 12

Patterns Frequency (P) Frequency(N){9, 36} 38 instances 0{9, 23} 38 0{4, 9} 38 0{9, 14} 38 0{6, 9} 38 0{7, 21} 0 36{7, 11} 0 35{7, 43} 0 35{7, 39} 0 34{24, 29} 0 34

Easy interpretation

Page 100: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

100

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

PCL: Prediction by Collective Likelihood

Page 101: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

101

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

PCL Learning

Top-Ranked EPs inPositive class

Top-Ranked EPs inNegative class

EP1P (90%)

EP2P (86%)

.

.EPn

P (68%)

EP1N (100%)

EP2N (95%)

.

.EPn

N (80%)

The idea of summarizing multiple top-ranked EPs is intendedto avoid some rare tie cases

Page 102: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

102

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

PCL Testing

ScoreP = EP1P’ / EP1

P + … + EPkP’ / EPk

P

Most freq EP of pos classin the test sample

Most freq EP of pos class

Similarly, ScoreN = EP1

N’ / EP1N + … + EPk

N’ / EPkN

If ScoreP > ScoreN, then positive class, Otherwise negative class

Page 103: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

103

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Accuracy of PCL (vs. other classifiers)

The classifiers are all applied to the 20 genes selected by χ2 at each level of the tree

Page 104: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

104

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Understandability of PCL

• E.g., for T-ALL vs. OTHERS, one ideally discriminatory gene 38319_at was found, inducing these 2 EPs

• These give us the diagnostic rule

Page 105: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

105

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Multidimensional Scaling Plot for Subtype Diagnosis

Obtained by performing PCA on the 20 genes chosen for each level

Page 106: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

106

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

ImpactConventional Tx:• intermediate intensity to everyone⇒ 10% suffers relapse⇒ 50% suffers side effects⇒ costs US$150m/yr

Our optimized Tx:• high intensity to 10%• intermediate intensity to 40%• low intensity to 50%• costs US$100m/yr

•High cure rate of 80%• Less relapse

• Less side effects• Save US$51.6m/yr

Page 107: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

107

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Asian Innovation Gold Award, Far Eastern Economic Review, November 2003

Page 108: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

108

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Is there a new subtype?

• Hierarchical clustering of gene expression profiles reveals a novel subtype of childhood ALL

Page 109: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

109

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Hierarchical Clustering

• Assign each item to its own cluster– If there are N items initially, we get N clusters,

each containing just one item• Find the “most similar” pair of clusters, merge

them into a single cluster, so we now have one less cluster – “Similarity” is often defined using

• Single linkage• Complete linkage• Average linkage

• Repeat previous step until all items are clustered into a single cluster of size N

Page 110: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

110

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Single, Complete, & Average Linkage

Single linkage defines distancebetw two clusters as min distancebetw them

Complete linkage defines distancebetw two clusters as max distance betwthem

Exercise: Give definition of “average linkage”

Image source: UCL Microcore Website

Page 111: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006

Selection of Patient Samples and Genes for

Disease Prognosis

Page 112: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

112

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Gene Expression Profile + Clinical Data

⇒ Outcome Prediction

• Univariate & multivariate Cox survival analysis (Beer et al 2002, Rosenwald et al 2002)

• Fuzzy neural network (Ando et al 2002)

• Partial least squares regression (Park et al 2002)

• Weighted voting algorithm (Shipp et al 2002)

• Gene index and “reference gene” (LeBlanc et al 2003)

• ……

Page 113: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

113

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Another Approach …

“extreme”sampleselection

ERCOF

Page 114: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

114

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Short-term Survivors v.s. Long-term Survivors

T: sampleF(T): follow-up time

E(T): status (1:unfavorable; 0: favorable)c1 and c2: thresholds of survival time

Short-term survivorswho died within a short

period

F(T) < c1 and E(T) = 1

Long-term survivorswho were alive after a

long follow-up time

F(T) > c2

Extreme Sample Selection

Page 115: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

115

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

ERCOFEntropy-

Based Rank Sum Test & Correlation

Filtering

Remove genes with expression values w/o cut point found (can’t be discretized)

Calculate Wilcoxonrank sum w(x) for gene x. Remove gene x if w(x)∈ [clower, cupper]

Group features by Pearson Correlation For each group, retain the top 50% wrt class entropy

Page 116: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

116

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Linear Kernel SVM regression functionbixTKyaTG i

ii +=∑ ))(,()(

T: test sample, x(i): support vector,yi: class label (1: short-term survivors; -1: long-term survivors)

Transformation function (posterior probability)

)(11)( TGe

TS −+= ))1,0()(( ∈TS

S(T): risk score of sample T

Risk Score Construction

Page 117: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

117

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Diffuse Large B-Cell Lymphoma

• DLBC lymphoma is the most common type of lymphoma in adults

• Can be cured by anthracycline-based chemotherapy in 35 to 40 percent of patients

⇒ DLBC lymphoma comprises several diseases that differ in responsiveness to chemotherapy

• Intl Prognostic Index (IPI) – age, “Eastern Cooperative

Oncology Group” Performance status, tumor stage, lactate dehydrogenase level, sites of extranodal disease, ...

• Not very good for stratifying DLBC lymphoma patients for therapeutic trials

⇒ Use gene-expression profiles to predict outcome of chemotherapy?

Page 118: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

118

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Rosenwald et al., NEJM 2002

• 240 data samples– 160 in preliminary group– 80 in validation group– each sample described by 7399 microarray

features• Rosenwald et al.’s approach

– identify gene: Cox proportional-hazards model– cluster identified genes into four gene signatures– calculate for each sample an outcome-predictor

score– divide patients into quartiles according to score

Page 119: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

119

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Knowledge Discovery from Gene Expression of “Extreme” Samples

“extreme”sampleselection:< 1 yr vs > 8 yrs

knowledgediscovery from gene expression

240 samples

80 samples26 long-

term survivors

47 short-term survivors

7399genes

84genes

T is long-term if S(T) < 0.3T is short-term if S(T) > 0.7

Page 120: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

120

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

732547+1(*)Informative

1607288OriginalDLBCL

AliveDead

TotalStatusData setApplication

Number of samples in original data and selected informative training set.(*): Number of samples whose corresponding patient was dead at the end of follow-up time, but selected as a long-term survivor.

Discussions: Sample Selection

Page 121: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

121

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

84(1.7%)Phase II

132(2.7%)Phase I

4937(*)Original

DLBCLGene selection

Number of genes left after feature filtering for each phase.(*): number of genes after removing those genes who were absent in more than 10% of the experiments.

Discussions: Gene Identification

Page 122: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

122

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

p-value of log-rank test: < 0.0001Risk score thresholds: 0.7, 0.3

Kaplan-Meier Plot for 80 Test Cases

Page 123: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

123

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

(A) IPI low, p-value = 0.0063

(B) IPI intermediate,p-value = 0.0003

Improvement Over IPI

Page 124: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

124

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

(A) W/o sample selection (p =0.38) (B) With sample selection (p=0.009)

No clear difference on the overall survival of the 80 samples in the validation group of DLBCL study, if no training sample selection conducted

Merit of “Extreme” Samples

Page 125: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

125

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Is ERCOF Useful? Observations from 1000+ Expts

• Feature selection methods considered– All use all features

– All-entropy select features whose value range can be partitioned by Fayyad & Irani’sentropy method

– Mean-entropy select features whose entropy is better than the mean entropy

– Top-number-entropy select the top 20, 50, 100, 200 genes by their entropy

– ERCOF at 5% significant level for Wilcoxon rank sum test and 0.99 Pearson correlation coeff threshold

• Data sets considered– Colon tumor– Prostate cancer– Lung cancer– Ovarian cancer– DLBC lymphoma– ALL-AML– Childhood ALL

• Learning methods considered– C4.5– Bagging, Boosting, CS4– SVM, 3-NN

Page 126: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

126

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

ERCOF vs All-Entropy

All-entropywins 4 times

ERCOF wins 60 times

Page 127: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

127

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

ERCOF vs Mean-Entropy

Mean-entropywins 18 times

ERCOF wins 42 times

Page 128: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

128

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Effectiveness of ERCOF

22Total wins 38 46 47 48 51 47 67

Page 129: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

129

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Conclusions

• Selecting extreme cases as training samples is an effective way to improve patient outcome prediction based on gene expression profiles and clinical information

• ERCOF is very suitable for SVM, 3-NN, CS4, Random Forest, as it gives these learning algoshighest no. of wins

• ERCOF is suitable for Bagging also, as it gives this classifier the lowest no. of errors

⇒ ERCOF is a systematic feature selection method that is very useful

Page 130: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

130

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Beyond Classification of Gene Expression Profiles

• After identifying the candidate genes by feature selection, do we know which ones are causal genes and which ones are surrogates?

Diagnostic ALL BM samples (n=327)

3σ-3σ -2σ -1σ 0 1σ 2σσ = std deviation from mean

Gen

es fo

r cla

ss

dist

inct

ion

(n=2

71)

TEL-AML1BCR-ABL

Hyperdiploid >50E2A-PBX1

MLL T-ALL Novel

Page 131: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

131

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Gene Regulatory Circuits

• Genes are “connected”in “circuit” or network

• Expr of a gene in a network depends on expr of some other genes in the network

• Can we “reconstruct”the gene network from gene expression and other data? Source: Miltenyi Biotec

Page 132: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

132

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Recommended Readings

• Wong, The Practical Bioinformatician, 2004, ICP. Chapter 14• Li & Wong, Identifying good diagnostic genes or gene groups

from gene expression data by using the concept of emerging patterns. Bioinformatics, 18:725-734, 2002

• Miller et al., Optimal gene expression analysis by microarrays. Cancer Cell, 2:353-361, 2002

• Liu et al,. Use of Extreme Patient Samples for Outcome Prediction from Gene Expression Data. Bioinformatics, 21:3377-3384, 2005

• Eisen et al., Cluster analysis and display of genome-wide expression patterns. PNAS, 8:14863-14868, 1998

• Tibshirani et al., Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS, 99:6567-6572, 2002

Page 133: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006

Biomedical Text Mining

50 min

Page 134: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

134

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Text Mining of Biomedical Literature

• Knowledge discovery system for processing free texts in scientific abstracts

• Specialized in– Extracting

protein names– Extracting

protein-protein interactions

– Etc …

Page 135: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

135

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Main Steps of Text Mining

Page 136: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

136

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Bio Entity Name Recognition, Zhou et al., BioCreAtIvE, 2004

Adapted w/ permission from Jian Su

Page 137: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

137

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Bio Entity Name Recognition:

Ensemble Classification Approach• Features considered

– orthographic, POS, morphologic, surface word, trigger words (TW1: receptor, enhancer, etc. TW2: activation, stimulation, etc.)

• SVM– Context of 7 words– Each word gives 5 features,

plus its position• HMMs

– 3 features used (orthographic, POS, surface word)

– HMM1 & HMM2 use POS taggers trained on diff corpora

HMM1balanced precision

& recall

HMM2low precision& high recall

SVMhigh precision& low recall

MajorityVoting

Page 138: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

138

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

• h1, h2, h3 are indep classifiers w/ accuracy = 60%• C1, C2 are the only classes• t is a test instance in C1

• h(t) = argmaxC∈{C1,C2} |{hj ∈{h1, h2, h3} | hj(t) = C}|• Then prob(h(t) = C1)

= prob(h1(t)=C1 & h2(t)=C1 & h3(t)=C1) +prob(h1(t)=C1 & h2(t)=C1 & h3(t)=C2) +prob(h1(t)=C1 & h2(t)=C2 & h3(t)=C1) +prob(h1(t)=C2 & h2(t)=C1 & h3(t)=C1)

= 60% * 60% * 60% + 60% * 60% * 40% +60% * 40% * 60% + 40% * 60% * 60% = 64.8%

Why Ensemble?

Page 139: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

139

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Bio Entity Name Recognition:

Performance at BioCreAtIvE 2004

Adapted w/ permission from Zhou et al., BioCreAtIve 2004

Page 140: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

140

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Co-Reference Resolution, Yang et al., IJCNLP, 2004

Adapted w/ permission from Yang et al., IJCNLP 2004

Page 141: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

141

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Co-Reference Resolution:

Baseline Features Used

Adapted w/ permission from Yang et al., IJCNLP 2004

Page 142: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

142

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Co-Reference Resolution:

New Features Used & Performance

Adapted w/ permission from Yang et al., IJCNLP 2004

Base Classifier:

C5.0

Page 143: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

143

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

• The max entropy model:

• where– o is the outcome– h is the feature vector– Z(h) is normalization

function– fj are feature functions– αj are feature weights

Protein Interaction Extraction, Xiao et al, IJCNLP 2004

Adapted w/ permission from Xiao et al., IJCNLP 2004

Page 144: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

144

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Protein Interaction Extraction:

Features Used

Adapted w/ permission from Xiao et al.

Page 145: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

145

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Protein Interaction Extraction:

Performance on IEPA Corpus

Adapted w/ permission from Xiao et al.

IEPA = Interaction Extraction Performance Assessment

Page 146: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

146

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Issues in Constructing Gene/Protein Networks

• Sound:– Is the contents of our databases correct?

• Complete:– Is the structure of our databases expressive

enough to capture critical information explicitly? • Understandable:

– Is our databases or search results understandable?

Page 147: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

147

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon WongAdapted w/ permission from Judice Koh, Mong Li Lee, & Vladimir Brusic

Soundness:Is the content of our databases correct?

• 11 types & 28 subtypes of data artifacts– Critical artifacts (vector

contaminated sequences, duplicates, sequence structure violations)

– Non-critical artifacts (misspellings, synonyms)

• > 20,000 seq records in public contain artifacts

• Identification of these artifacts are impt for accurate knowledge discovery

• Sources of artifacts– Diverse sources of data

• Repeated submissions of seqs to db’s• Cross-updating of db’s

– Data Annotation• Db’s have diff ways for data

annotation• Data entry errors can be introduced• Different interpretations

– Lack of standardized nomenclature

• Variations in naming• Synonyms, homonyms, & abbrevn

– Inadequacy of data quality control mechanisms

• Systematic approaches to data cleaning are lacking

Page 148: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

148

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Classification of Errors,

Koh et al, DBiDB, 2005

Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Page 149: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

149

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Context of the misspellingsCorrectionsMisspellings

EMBL:Y18050 E.faecium pbp5 geneTITLE Modification of penicillin-binding protein 5 asociated with highlevel ampicillin resistance in Enterococcus faeciumgi|1143442|emb|X92687.1|EFPBP5G

associatedasociated

Swiss-Prot:P03385Env polyprotein precursor DEFINITION Env polyprotein precursor [Contains: Surface protein (SU) (GP70);Tranmembrane protein (TM) (p15E); R protein].gi|119478|sp|P03385|ENV_MLVMO

transmembranetranmembrane

Patent Database:A76783 Sequence 11 from Patent WO9315210CDS <1..150/note="gene cassete encoding intercalating jun-zipper andlinker"gi|6088638|emb|A76783.1||pat|WO|9315210|11[6088638]

CassetteCassete

GenBank:AAD26534 nectin-1 [Rattus norvegicus]TITLE Nectin/PRR: An Immunogloblin-like Cell Adhesion Molecule Recruitedto Cadherin-based Adherens Junctions through Interaction withAfadin, a PDZ Domain-containing Proteingi|4590334|gb|AAD26534.1

ImmunoglobulinImmunoglobinRECORD

SINGLE SOURCE DATABASE

Invalid values

Ambiguity

Incompatible schema

ATTRIBUTE

Spelling errorsFormat violation

Annotation error

Dubious sequences

Sequence redundancy

Data Provenanceflaws

Cross-annotation error

Sequence structure violation

• Usually typo errors

• Occurs in different fields of the record

• We identified 569 possible misspelled words affecting up to 20,505 nucleotide records in Entrez.

Vector contaminated sequence

Erroneous data transformation

MULTIPLESOURCE DATABASE

Spelling Errors

Copyright © 2006 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Page 150: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

150

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

RECORD

SINGLE SOURCE DATABASE

Invalid values

Ambiguity

Incompatible schema

ATTRIBUTE

Uninformative sequences

Undersized sequences

Annotation error

Dubious sequences

Sequence redundancy

Data provenance flaws

Cross-annotation error

Sequence structure violation

Vector contaminated sequence

Erroneous data transformation

MULTIPLESOURCE DATABASE

Among the 5,146,255 protein records queried using Entrez to the major protein or translated nucleotide databases , 3,327 protein sequences are shorter than four residues (as of Sep, 2004).

• In Nov 2004, the total number of undersized protein sequences increases to 3,350.

• Among 43,026,887 nucleotide records queried using Entrez to major nucleotide databases, 1,448 records contain sequences shorter than six bases (as of Sep, 2004).

• In Nov 2004, the total number of undersized nucleotide sequences increases to 1,711.

Undersized protein sequences in major databases

218 17142

528

116 151

1015

364 383

12351

1253 2 120 0 23

0

200

400

600

800

1000

1200

1 2 3

Sequence Length

Num

ber o

f rec

ords DDBJ

EMBL

GenBank

PDB

SwissProt

PIR

Undersized nucleotide sequences in major databases

2 3 924

5573 69

11581

233228

108 108

5167

640 45

77104

0

50

100

150

200

250

1 2 3 4 5

Sequence LengthNu

mbe

r of r

ecor

ds

DDBJEMBLGenBankPDB

Copyright © 2006 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Meaningless Seqs

Page 151: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

151

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

RECORD

SINGLE SOURCE DATABASE

Invalid values

Ambiguity

Incompatible schema

ATTRIBUTE

Overlapping intron/exon

Annotation error

Dubious sequences

Sequence redundancy

Data Provenanceflaws

Cross-annotation error

Sequence structure violation

Vector contaminated sequence

Erroneous data transformation

MULTIPLESOURCE DATABASE

Syn7 gene of putative polyketide synthase in NCBI TPA record BN000507 has overlapping intron 5 and exon 6.

rpb7+ RNA polymerase II subunit in GENBANK record AF055916 has overlapping exon 1 and exon 2.

Overlapping Intron/Exons

Copyright © 2006 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Page 152: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

152

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

RECORD

SINGLE SOURCE DATABASE

Invalid values

Ambiguity

Incompatible schema

ATTRIBUTE

Annotation error

Dubious sequences

Sequence redundancy

Data provenance flaws

Cross-annotation error

Sequence structure violation

Vector contaminated sequence

Erroneous data transformation

MULTIPLESOURCE DATABASE

Replication of sequence informationDifferent views

Overlapping annotations of the same sequence

• Submission of the same sequence to different databases

• Repeated submission of the same sequence to the same database

• Initially submitted by different groups

• Protein sequences may be translated from duplicate nucleotide sequences

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=11692005&dopt=GenPept

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=protein&list_uids=11692005&dopt=GenPept

Seqs w/ Identical Info

Copyright © 2006 by Limsoon Wong. Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Page 153: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

153

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

SINGLE-SOURCE DATABASE

ATTRIBUTE

Schema remapping

Sequence Structure Parser

RECORD

Concatenated values

Spelling errors

Format violation

Synonyms

Homonyms/Abbreviations

Misuse of fields

Features do not correspond with sequence

Dictionary lookup

Integrity constraints

Undersized sequences

Uninformative sequences

Replication of sequence information

Different views

Overlapping annotations of the same sequence

Fragments

Duplicate detection

Mis-fielded values

Comparative analysis

Sequence structure violation

MULTI-SOURCEDATABASE

Vector contaminated sequencesVector screening

Putative features

Cross-annotation error

Trying our hands on

data cleansing problems

Page 154: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

154

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Association Rule Mining for De-

duplication

Compute similarity scores from known duplicate pairs

Select matching criteria

Generate association rules

Detect duplicates using the rules

Page 155: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

155

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Scorpion venom datasetcontaining 520 records

695 duplicate pairs are collectively identified.

Snake PLA2 venom dataset containing 780 records

Entrez (GenBank, GenPept, SwissProt, DDBJ, PIR, PDB)

251 duplicate pairs 444 duplicate pairs

scorpion AND (venom OR toxin) serpentes AND venom AND PLA2

Expert annotation

Dataset

Page 156: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

156

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Features to Match

Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Page 157: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

157

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

AAG39642 AAG39643 AC0.9 LE1.0 DE1.0 DB1 SP1 RF1.0 PD0 FT1.0 SQ1.0

AAG39642 Q9GNG8 AC0.1 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT0.1 SQ1.0

P00599 PSNJ1W AC0.2 LE1.0 DE0.4 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0

P01486 NTSREB AC0.0 LE1.0 DE0.3 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0

O57385 CAA11159 AC0.1 LE1.0 DE0.5 DB0 SP1 RF0.0 PD0 FT0.1 SQ1.0

S32792 P24663 AC0.0 LE1.0 DE0.4 DB0 SP1 RF0.5 PD0 FT1.0 SQ1.0

P45629 S53330 AC0.0 LE1.0 DE0.2 DB0 SP1 RF1.0 PD0 FT1.0 SQ1.0

LE1.0 PD0 SQ1.0 (99.7%)SP1 PD0 SQ1.0 (97.1%)SP1 LE1.0 PD0 SQ1.0 (96.8%)DB0 PD0 SQ1.0 (93.1%)DB0 LE1.0 PD0 SQ1.0 (92.8%)DB0 SP1 PD0 SQ1.0 (90.4%)DB0 SP1 LE1.0 PD0 SQ1.0 (90.1%)RF1.0 SP1 LE1.0 PD0 SQ1.0 (47.6%)RF1.0 DB0 LE1.0 PD0 SQ1.0 (44.0%)

Similarity scores of known

duplicate pairs

Association rule mining

Frequent item-set with support

Association Rule Mining

Page 158: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

158

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Duplicates detected by association rules

6

36.3

0.3

49.4

5.2

32.7

0.11.8 2.4 3.8 5.7 7.5 7.9 9.4

0

10

20

30

40

50

60

Rule 1Rule 2

Rule 3Rule 4

Rule 5Rule 6

Rule 7

Association rulesFP

% a

nd F

N%

FP% FN% x 1000

S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.1%)Rule 7

S(Seq)=1 ^ M(Species)=1 ^ M(PDB)=0 ^ M(DB)=0 (90.4%)Rule 6

S(Seq)=1 ^ M(Seq Length)=1 ^ M(PDB)=0 ^ M(DB)=0 (92.8%)Rule 5

S(Seq)=1^ M(PDB)=0 ^ M(DB)=0 (93.1%)Rule 4

S(Seq)=1 ^ N(Seq Length)=1 ^ M(Species)=1 ^ M(PDB)=0 (96.8%)Rule 3

S(Seq)=1 ^ M(PDB)=0 ^ M(Species)=1 (97.1%)Rule 2

S(Seq)=1 ^ N(Seq Length)=1 ^ M(PDB)=0 (99.7%)Rule 1

Results

Adapted w/permision from Judice Koh, Mong Li Lee, & Vladimir Brusic

Rule 1. Identical sequences with the same sequence length and not originated from PDB are 99.7% likely to be duplicates.

Rule 2. Identical sequences with the same sequence length and of the same species are 97.1% likely to be duplicates.

Rule 3. Identical sequences with the same sequence length, of the same species and not originated from PDB are 96.8% likely to be duplicates.

Page 159: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

159

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Completeness:Is the structure of our db expressive

enough to capture critical info explicitly?

• Take a key paper such as the Kohn paper that summarises current knowledge on p53 regulation

• Is there a structured database that is able to capture all info in that paper explicitly?

• Is there a semi-structured database that is able to capture all info in that paper explicitly?

• How well does this (semi-) structured database generalize to other similar type of papers?

Page 160: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

160

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Understandability:Is our db or search results understandable?• Take a search on p53. You will get >300k hits or

some number like that on MEDLINE• It is not feasible for anyone to go thru all of that

to find what he wants! And this problem is growing bigger as MEDLINE doubles every 1-2 year.

• Need to organize the database and/or the search results into hierarchy or “semantic” net to make it easier for users to understand or to browse the results

• How do we define this hierarchy/net? • Can this hierarchy/net be self-organized?

Page 161: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

161

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Important Unsolved Challenges

• Bio entity name recognition• Bio interaction extraction• Extraction of other relevant info• Information integration• And so on …

Page 162: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

162

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Bio Name Recognition

• Nomenclature loosely followed• Frequent use of conjunction and disjunction in

bio names with multiple bio-entity names sharing one head noun

• Long descriptive names• Names of genes and proteins used

interchangeably

Page 163: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

163

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

• Inherent complexity of biological interactions

⇒ Sentences describing them also tend to be complicated

Bio-Interaction Extraction

Page 164: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

164

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Bio-Interaction Extraction

• Domain knowledge is often needed for interaction template filling

Page 165: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

165

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Extraction of Other Relevant Info

• Contextual information– Species, cell type, cellular localisation, etc

• Negative information

• Speculative & incomplete facts

Page 166: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

166

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Information Integration

• Bio-name mapping

• Bio-interaction mapping– how do you know two complex sentences are

talking about the same interaction?

Page 167: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

167

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Recommended Readings

• Hirschman et al., Accomplishments and challenges in literature data mining for biology. Bioinformatics, 18:1553-1561, 2002

• Ng & Tan, Challenges in biological literature mining for online discovery of molecular interaction pathways, IJCAT, 2006. (To appear)

• Zhou et al., Recognizing names in biomedical texts: a machine learning approach, Bioinformatics, 20:1178-1190, 2004

• Blaschke et al., Do you text? Bioinformatics, 21:4199-4200, 2005

Page 168: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006

Discovering Motif Pairs at Interaction Sites

from Protein Sequences on a

Proteome-Wide Scale

60 min

Page 169: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

169

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Proteins & Their Interactions

• 4 types of reps for proteins: primary, secondary, tertiary, & quaternay

• Protein interactions play impt role in inter cellular communication, in signal transduction, & in the regulation of gene expression

Courtesy of JE Wampler

Page 170: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

170

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Binding Sites

• Discovery of binding sites is a key part of understanding mechanisms of protein interactions

• Structure-based approaches– E.g., docking– Relatively accurate– Struct must be known

⇒ Sequence-based approaches

Page 171: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

171

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Typical Sequence-Based Approach

• Typical seq-based approaches have two steps:– Use pattern discovery algorithms to discover

domains and/or motifs of a group of proteins– Use domain-domain interaction discovery

methods (e.g., domain fusion) to discovery interacting domains

• Shortcomings:– Protein interaction information is not used by motif

discovery algorithms– Exact positions of binding sites often not

recognized

Page 172: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

172

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

How about ...

• How about making use of known protein-protein bindings to guide the discovery of binding motifs?

Page 173: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

173

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Yeast SH3 domain-domainInteraction network: 394 edges, 206 nodes

Tong et al. Science, v295. 2002

8 proteins containing SH35 binding at least 6 of them

Protein Interaction Graphs

Page 174: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

174

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Bipartite Subgraphs

SH3 Proteins SH3-BindingProteins

The larger this group,the more likely theiractive sites will showup clearly in a multiplealignment?

Page 175: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

175

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Problem StatementGiven a PPI expt E, the problem is

(1) To find all pairs X, Y of interacting protein groups, so that

(1.1) X and have full mutual interactions(1.2) X and Y are as large as possible

&

(2) To identify “good” binding motif pairs from these pairs of interacting protein groups

Page 176: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

176

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

PPI Expt As a Graph

• PPI expt E as undirected graph GE = ⟨VE, DE⟩, – where VE are the proteins and DE the

edges,– so that two proteins are connected in GE iff

there is a binding betw them in PPI expt E

• Let LE(p) denote neighborhood of protein p in GE

• Let LE(P) = ⎧⎫p∈P LE(p) denote the common neighborhood of all proteins in P in GE

Page 177: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

177

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Maximality

• Proposition 2.1Let E be a PPI expt. Let X,Y be a pair of protein groups so that X =

LE(Y) and Y = LE(X). Let X’,Y’ be another pair of protein groups so that

X’ = LE(Y’), Y’ = LE(X’), X’⊆ X, & Y’ ⊆ Y. Then X = X’ and Y = Y’.

⇒ In other words, if X = LE(Y) and Y = LE(X), then X,Y is a maximal pair of protein groups that have full interactions

Page 178: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

178

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Recasting to Graph Theory

• X, Y is a pair of interacting protein groups in PPI exptE iff X = LE(Y) and Y = LE(X) 1.1

1.2

Page 179: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

179

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Max Complete Bipartite Subgraph

• A graph H = ⟨V1∪V2, DH⟩ is a maximal complete bipartite subgraph of G iff– H is a subgraph of G, – V1 × V2 = DH, – V1 ∩ V2 = {}, &– There is no H’ = ⟨V’1∪V’2, DH’⟩ with V1 ⊂ V’1

& V2 ⊂ V’2 that has the same properties above

Page 180: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

180

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

The Connection to Graph Theory

• X, Y is a pair of interacting protein groups in PPI exptE iff H = ⟨X ∪Y, X ×Y ⟩ is max complete bipartite subgraphof GE

• Let H = ⟨X ∪Y, DE| X ∪Y⟩ be subgraph of GE with X,Y a pair of interacting protein groups

⇒ X = LE(Y) and Y = LE(X) ⇒ Full interactions betw X & Y⇒ X × Y = DE| X ∪Y

• By excluding self-binding, we have X ∩ Y = {}

• By Prop 2.1, we have H is max

Page 181: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

181

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Therefore … But ...

• Therefore, to find pairs of interacting protein groups, we can use algorithms from graph theory for enumerating maximal complete bipartite subgraphs

• According to Eppstein 1994, this has complexity O(a322an), where a is the aboricity of the graph and n the number of vertices

• This is inefficient because a is often around 10-20 in practice

Page 182: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

182

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

From PPI Expts To Transactions

• In PPI expt E, we obtain for each protein p, a list LE(p) of proteins that bind p– assume p∉LE(p), as such expts are not intended to detect self-binding

– assume q∈LE(p) implies p∈ LE(q), as binding is symmetric

• LE(p) can be thought of as a transaction & tE(p) as the “id” of this transaction

⇒ E can be thought of as generating a db of transactions DE = {tE(p1), …, tE(pk)}, where p1, …, pkare all the proteins involved in E

⇒ a set of proteins X can be thought of as a patternin DE if there is tE(p)∈ DE st X ⊆ LE(p)

Page 183: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

183

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Example

• Consider expt E with 5 proteins p1, …, p5, st p2and p3 bind every protein except themselves

• Then DE looks like this (as a matrix):

p1

p2

p3

p4

p5

Page 184: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

184

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Notations

• Let sE(d) denote the protein p st tE(p) = d⇒ sE(tE(p)) = p• Let [|p|]E denote the set {tE(q) | p∈LE(q)} of

transactions in DE in which p occurs⇒ tE(p)∈ [|q|]E implies tE(q)∈[|p|]E

• Let [|X|]E denote the set ⎧⎫p∈X [|p|]E of transactions in which the pattern X occurs

⇒ tE(Y)⊆[|X|]E implies tE(X)⊆[|Y|]E

• Let tE(X) denote the set {tE(p) | p∈ X} of transaction id’s, where X is a pattern in DE

• Let sE(T) denote the pattern {sE(d) | d∈T}⇒sE[|X|]E = ⎧⎫p∈X LE(p) = LE(X)

Page 185: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

185

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Closed Patterns

• Let [X]E = {Y | [|Y|]E = [|X|]E} denote the equivalence class of the pattern X in DE

• A pattern X is said to be a closed pattern of DE iff X = closedE(X), where {closedE(X)} = max [X]E

Page 186: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

186

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Key Proposition

• Proposition 3.2Let X be a closed pattern in DE.Then X = sE [|sE [|X|]E |]E

X

[|X|]

sE[|X|]E

[|sE[|X|]E|]E

Page 187: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

187

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Proof

Page 188: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

188

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Consequently...

• Corollary 3.4Let X and Y be closed pattern in DE.Then X = Y iff sE [|X|]E = sE [|Y|]E

• Proposition 3.5For any pattern X, we have X ∩ sE [|X|]E = {}

Page 189: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

189

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Implication

• Corollary 3.6Let E be a PPI expt.Let C be the set of closed patterns of DE.Then |C| is even

Page 190: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

190

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

More Important Implication

• Theorem 3.7The bijective function sEo[|·|]E partitions the set of closed patterns of DE into bipartite graphs

p1

p2

p3

p4

p5

Page 191: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

191

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Even More Interesting...

• For each pair X and Y, – X is the largest group of

proteins that bind all the proteins in Y; and

– Y is the largest group of proteins that bind all the proteins in X

⇒ X,Y is a pair of interacting protein groups

Page 192: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

192

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

A Couple More Propositions

• Proposition 3.3For any pattern X, sE[|X|]E is closed pattern in DE

• Corollary 3.7X is a closed pattern in DE iff X = sE[|sE[|X|]E|]E

“the common neighbours of the common neighbours of X

are X themselves.”

“each side of a protein interaction group is a

closed pattern”

“X is one side of a protein interaction

group”

Recall that sE[|X|]E = LE(X)

Page 193: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

193

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

At Last!

• These are ALL the interacting protein groups

⇒ To mine these protein groups, it suffices to mine closed patterns in DE

Page 194: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

194

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

An Extension

• Not all interacting protein groups X, Y are equally interesting– X and Y are both singleton, vs– X is a large group, Y is small group, vs– X is a large group, Y is a large group

⇒ Set “interestingness” threshold on X, Y st a pair of interacting protein groups X, Y is interesting only if |X| ≥ m and |Y| ≥ n

Page 195: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

195

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

An Optimization

• Let X, Y be a pair of interacting protein groups– By Theorem 3.7, X = sE [|Y|]E and Y = sE [|X|]E

– By Definition of [|•|]E, |X| = times Y occurs in DE

– By Definition of [|•|]E, |Y| = times X occurs in DE

⇒ To mine interesting pairs X, Y of interacting protein groups in an expt E such that |X| ≥ m and |Y| ≥ n, it suffices to mine closed patterns X that appears ≥ n times in DE and |X| ≥ m

Page 196: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

196

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Closed Pattern Mining Algorithms

• CLOSET, Pei et al. 2000• CARPENTER, Pan et al. 2003• FPclose*, Grahne & Zhu 2003• GC-growth, Li et al. 2005• ...

⇒We have efficient algorithms for mining interesting interacting protein groups

Page 197: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

197

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Example Breitkreutz et al, Genome Biology, 4, R23, 2003X and sE[|X|]E both occur with freqat least that of support threshold

Page 198: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

198

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

It now remains to generate the motif pairs …

Page 199: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

199

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Many Motif Discovery Methods

• MEME, Bailey & Elkan 1995

• CONSENSUS, Hertz & Stormo 1995

• PROTOMAT, Henikoff & Henikoff 1991

• CLUSTAL, Higgins & Sharp 1988

• …

• For illustration, we use PROTOMAT here

Page 200: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

200

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

PROTOMAT• Core of Block Maker, a WWW server that return

blocks (ungapped multiple alignments) for any submitted set of protein sequences

• Comprises 2 steps:– MOTIF, Smith et al. 1990

• Look for spaced triplets in given set of proteins– MOTOMAT, Henikoff & Henikoff 1991

• Merge overlapping blocks produced by MOTIF• Extend blocks in both directions until similarity falls• Determine best set of blocks that are in the same order

and do not overlap

we treat every block, instead of whole set of blocks generated by PROTOMAT, as a binding motif

Page 201: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

201

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Example, Breitkreutz et al, Genome Biology, 4, R23, 2003

• Comprises 19038 genetic and physical interactions in yeast among 4907 proteins

• Look for interesting pairs with m = n = 5• About 1s to generate 60k closed patterns⇒ Too many for PROTOMAT. So consider only

maximal closed patterns, giving 7847 pairs• PROTOMAT produces 17256 left blocks and

19350 right blocks after 6 hours • Most groups yield 1 to 3 blocks• Ave length of blocks = 11.696, std dev = 5.45

Page 202: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

202

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Databases Used for Validation

• BLOCKS, Pietrokovski et al. 1996

• PRINTS, Attwood & Beck 1994

• Pfam, Sonnhammer et al. 1997

• InterDom, Ng et al. 2003

Page 203: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

203

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Validation for Single Motifs

• Compare all single motifs in our discovered motif pairs with all domains of specific domain databases – LAMA, Pietrokovski 1996– transform blocks into position-specific scoring matrices (PSSM)– run Smith-Waterman to align pairs of PSSM using Pearson

correlation coefficient to measure similarity betw 2 columns– a block is mapped to another block if 95% of positions in a block

occuring in the optimal alignment is common to another block and Z-score is > 5.6, where Z-score is the number std dev away from the mean generated by millions of shuffles of the BLOCKS database

• Determine number of motifs that can be mapped to these domains and the overall correlation in the portions that are mapped

Page 204: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

204

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Results for Single Motifs

• Our blocks map to 32% of blocks in BLOCKS and PRINTS, yet motifs from our blocks cover 72% of domains in BLOCKS and PRINTS

⇒ Maybe most domains in BLOCKS and PRINTS have less than half a block as binding motifs, or may not be related to binding behaviour

Page 205: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

205

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Validation for Motif Pairs

• Map our motif pairs into domain-domain interacting pairs

• Determine the number of overlaps between our motif pairs and those in the domain-domain interaction database

• Use InterDom as the domain-domain interaction database 30037

interactionsamong3535 domains

Page 206: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

206

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Linking Our Motif Pairs to InterDom

• InterDom represents domains by Pfam entries

⇒ To x-link, we have to– Map our motifs to blocks in BLOCKS and PRINTS– Link from BLOCKS and PRINTS to InterPro– Link from InterPro to Pfam– Match Pfam to InterDom

Page 207: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

207

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Results for Motif Pairs

Domain-domain interactionsinferred from protein complexesor from interactions betweensingle domain proteins

Both sidesmapped to BLOCKS

Both sidesmapped to PRINTS

One side mapped to PRINTS,one side mapped to BLOCKS

Page 208: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

208

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Example Confirmed Binding Motif

• 1 of the 241 binding motifs we found that can be confirmed using protein complexes is #1781...

As shown in the next slide, this pair corresponds to interactionsites between LSM domains. E.g., all 7 pairs of adjacent

LSM domains of pdb1mgq exhibits it.

Page 209: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

209

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Example: LSM Domainsof pdb1mgq

Page 210: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

210

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Sequence Identity Within a Group

• Ave seq identity within a group is 7.5%

⇒Group is unlikely to be detected by standard methods based on seqhomology

Page 211: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

211

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Conclusions

• Connection between maximal complete bipartite subgraphs and closed patterns

⇒ Closed pattern mining algorithms can be used to enumerate maximal complete bipartite subgraphsefficiently

• Connection between pairs of interacting protein groups and closed patterns

⇒ Discovery of binding motifs is accelerated because we need not execute expensive motif discovery algorithms on insignificant groups

Page 212: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

212

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

References

• Haiquan Li, Jinyan Li, Limsoon Wong. Discovering Motif Pairs at Interaction Sites from Protein Sequences on a Proteom-Wide Scale. Bioinformatics, 22(8):989--996, 2006

• Jinyan Li, Haiquan Li, Donny Soh, Limsoon Wong. A Correspondence Between Maximal Complete Bipartite Subgraphs and Closed Patterns. Proceedings of 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 146--156, Porto, Portugal, October 2005

• Smith et al., Finding Sequence Motifs in Groups of Functionally Related Proteins. PNAS, 87:826-830, 1990

• Henikoff & Henikoff. Automated assembly of protein blocks for database searching. NAR, 19:6565-6572, 1991

Page 213: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006

Advancing Knowledge Discovery

10 min

Page 214: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

214

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Some of those “techniques”

frequently needed in analysis of

biomedical data are insufficiently

studied by current data mining researchers

• Recognizing what samples are relevant and what are not

• Recognizing what features are relevant and what are not & handling missing or incorrect values

• Recognizing trends, changes, and their causes

Page 215: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

215

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Going Beyond Frequent Patterns:Marrying Statisticians and Data Miners

• Statisticians use a battery of “interestingness”measures to decide if a feature/factor is relevant

• Examples:– Odds ratio– Relative risk– Gini index– Yule’s Q & Y– etc

• Odds ratio

−−

−=

,

,

,

,

),(

D

eD

dD

edD

PPPP

DPOR

Page 216: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

216

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Going Beyond Frequent Patterns:Experimenting With Tipping Events

• Given a data set, such as those related to human health, it is interesting to determine impt cohorts and impt factors causing transition betw cohorts

⇒ Tipping events⇒ Tipping factors are “action

items” for causing transitions

• “Tipping event” is two or more population cohorts that are significantly different from each other

• “Tipping factors” (TF) are small patterns whose presence or absence causes significant difference in population cohorts

• “Tipping base” (TB) is the pattern shared by the cohorts in a tipping event

• “Tipping point” (TP) is the combination of TB and a TF

Page 217: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006

Any Question?

Page 218: An Introduction to Knowledge Discovery Applications and ...wongls/talks/lorne-aug06/lorne-aug0… · • Quick introduction to knowledge discovery (13 slides, 20 min) • Apps and

218

EII PhD Winter School in Data Mining & Bioinformatics, Lorne, 13-17 August 2006 Copyright 2006 © Limsoon Wong

Acknowledgements• Disease Treatment Planning & Understanding

– Jinyan Li, Huiqing Liu– Allen Yeoh, Richie Soong, Benjamin Mou, Anthony Tung

• Gene Feature Recognition– Huiqing Liu, Rajesh Chowdhary, Vladimir Bajic– Fanfan Zeng, Roland Yap, Liang Huai Yang, Yun Zheng,

Mong Li Lee, Wynne Hsu• Protein Function Inference & Binding Motifs

– Jinyan Li, Haiquan Li, See-Kiong Ng, Chris Tan, Donny Soh– Hon Nian Chua, Wing-Kin Sung, Jing Chen

• Text Mining from Biomedical Literature– See-Kiong Ng, Jian Su, Guodong Zhou– Chew Lim Tan

• Data Mining Theory & Technologies– Jinyan Li, Haiquan Li, Guimei Liu – Meng Ling Feng, Guozhu Dong