an introduction to knowledge discovery and data miningbao/talks/pdcattutorial.pdf · an...

95
An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science Japan Advanced Institute of Science and Technology

Upload: others

Post on 25-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

An Introduction to Knowledge Discovery and Data Mining

TuBao HoSchool of Knowledge ScienceJapan Advanced Institute of Science and Technology

Page 2: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 2

Outline

Basic concepts of KDD

KDD techniques: classification, association, clustering, visualization

Challenges and trends in KDD

KDD and high performance computing

Case studies in medicine data mining

Page 3: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 3

Un-interpretedsignals1st 2nd 3rd 4th …25 27 21 26 …

data equipped with meaning(temperature of the days)

integrated information, including facts and their relations (“justified true belief”)(E = mc2)

Data, Information, Knowledge

Data mining metaphor: extractingore from rock

Page 4: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 4

1. ( 5.6, 8.5)2. ( 6.0, 13.0)3. (11.0, 12.0)4. (11.0, 19.0)5. (13.5, 10.0)6. (16.5, 20.0)7. (17.5, 15.0)8. (17.5, 5.0)9. (22.5, 25.0)10. (26.0, 7.5)11. (30,0, 9.0)12. (30.0, 18.0)13. (30.0, 30.0)14. (31.0, 14.0)15. (32.5, 25.0)16. (38.0, 12.0)17. (41.0, 9.0)18. (41.0, 22.0)19. (43.5, 12.5)20. (44.0, 27.5)21. (45.0, 22.5)22. (48.0, 28.0)23. (52.5, 21.0)24. (53.5, 32.0)25. (54.0, 27.5)26. (57.5, 18.0)27. (59.0, 18.0)28. (62.5, 32.5)29. (63.0, 18.0)“if income < $33K, then the person has defaulted on the loan”

Mean of Debt = 18.4, Mean of Income = 34.5

33

US$ K(income, debt)

0

34.5, 18.4

(information)

(knowledge)

Have defaultedon the loan

Good statuswith the bank

Debt

Income

Data, Information, Knowledge

Page 5: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 5

Knowledge Discovery and Data Mining (KDD)

106-1012 bytes:never see the whole data set or put it in thememory of computers

What knowledge?How to represent and use it?

Data mining algorithms?

the automatic extraction of non-obvious, hidden knowledge (patterns/models) from large volumes of data

Page 6: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 6

...10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS, VIRUS

12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA

15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA

16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ?, negative, ?, n, n, ABSCESS, VIRUS...

IF cell_poly <= 220 AND Risk = n AND Loc_dat = + ANDNausea > 15 THEN Prediction = VIRUS [confidence = 87,5%]

From Data to Knowledge

Meningitis data, Tokyo Med. & Dental Univ., 38 attributes

numerical categorical missing class attribute

Page 7: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 7

DatabasesStore, access, search, update data (deduction)

Statistics Infer information from data (deduction and induction, mainly numeric data)

Machine LearningComputer algorithms that improve automatically through experience (mainly induction, symbolic data)

KDD

KDD: An Interdisciplinary Field

also Algorithmics, Visualization, Data warehouses, OLAP, etc.

Page 8: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 8

KDD’95, 96, 97, 98, 99, 00, 01, 02 (ACM, America)PAKDD’97, 98, 99, 00, 01, 02 (Pacific Rim & Asia)PKDD’97, 98, 99, 00, 01, 02 (Europe)ICDM’01, 02 (IEEE), SDM’01, 02 (SIAM)

Industrial Interest: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, …

Japan: FGCS Project focus on logic programming and reasoning; attention has been paid on knowledge acquisition and machine learning.2001-2004: “Active Mining Project”

KDD: New and Fast Growing Area

Page 9: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 9

High-powered computers (larger disks, faster cpus) and networked data become widely available

People gathered and stored so much data because they think some valuable assets are implicitly coded within it. Its true value depends on the ability to extract useful information

Impractical manual data analysis

How to acquire knowledgefor knowledge–based systems remains as the main difficult and crucial AI problem

Why KDD?

Page 10: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 10

Relational DatabasesA relational database is a collection of tables, each of which is assigned a unique name, and consists of a set of attributes and a set of tuples.

Cust-ID name address age income credit-info .C1 Smith, Sandy 5463 E Hasting, Burnaby 21 $27000 1 …

BC V5A 459, Canada … … … … … … …

Item-ID name brand category type price place-made supplier cost I3 high-res-TV Toshiba high resolution TV $988.00 Japan NIkoX $600.00I8 multidisc- Sanyo multidisc CD player $369.00 Japan MusicFont $120.00

… CDplayer … … … … … … …

customer

item

Emp-ID name category group salary commisionE35 Jones, Jane home entertainmentl manager $18,000 2%… … … … … …

employee

Branch-ID name addressB1 City square 369 Cambie St., Vancouver, BC V5L 3A2, Canada… … …

branch

Trans-ID cust-ID empl-ID data time method-paid amountT100 C1 B55 01/21/98 15:45 Visa $1357.00… . … … … … … …

purchases

Trnas-ID item-ID sty

T100 I3 1T100 I8 2… … …

Empl-ID branch-ID

E55 B1… …

Item-sold works-at

Page 11: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 11

Data Warehouses

A data warehouse is a repository of information collected from multiple resources, stored under a unified schema, and which is usually resides at a single site.

Data sourcein Chicago

Data sourcein New York

Data sourcein Vancouver

Data sourcein Toronto

CleanTransformIntegrateLoad

Data warehouse

Query andanalysis tool

client

client

Page 12: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 12

Transactional Databases

A transactional database consists of a file where each record represents a transaction. A transaction typically includes a unique transaction identity number (trans_ID), and list of the items making up the transaction

Trans_ID list of item_ID

T100 I1, I3, I8, I16T200 I3, I5, I23…. …

Page 13: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 13

Object-Oriented Databases

Object-Relational Databases

Spatial Databases

Temporal Databases and Time-Series Databases

Text Databases and Multimedia Databases

Heterogeneous Databases and Legacy Databases

The World Wide Web

Advanced Database Systems

Page 14: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 14

Spatial databases contain spatial-related information: geographic databases, VLSI chip design databases, medical and satellite image databases etc.

Data mining may uncover patterns describing the characteristics of houses located near a specified kind of location, the climate of mountainous areas located at various altitudes, etc.

Spatial Databases Japanese earthquakes

1961-1994

Page 15: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 15

Temporal and Time-Series Databases

They store time-related data. A temporal database stores relational data that include time-related attributes (timestamps with different semantics). A time-series database stores sequences of values that change with time (stock exchange)

Data mining finds the characteristics of object evolution, trend of change for objects: e.g., stock exchange data can be mined to uncover trends in investment strategies

Page 16: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 16

Text and Multimedia Databases

Text databases contain documents, usually highly unstructured or semi-structured. To uncover general descriptions of object classes, keywords, content associations, clustering behavior of text objects, etc.

Multimedia databases store image, audio, and video data: picture content-based retrieval, voice-email systems, video-on-demand-systems, speech-based user interface, etc.

Page 17: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 17

The Web provides an enormous source of explicit and implicit knowledge that people can navigate and search for what they need.

Example: When examining the data collected from Internet Mart, heavily trodden paths gave BT hints to regions of the site which were of key interest to its visitors.

The World Wide Web

Page 18: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 18

KDD is inherentlyinteractive and iterative

a step in the KDD process consisting of methods that produce useful patterns or models from the data

1

3

4

5

Understand the domainand Define problems

Collect andPreprocess Data

Data MiningExtract Patterns/Models

Interpret and Evaluatediscovered knowledge

Putting the resultsin practical use

Maybe 70-90% of effort and cost in KDD

The KDD Process

2

Page 19: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 19

Data organized by function

Create/selecttarget database

Select samplingtechnique and

sample data

Supply missing values

Normalizevalues

Select DM task (s)

Transform todifferent

representation

Eliminatenoisy data

Transformvalues

Select DM method (s)

Create derivedattributes

Extract knowledge

Find importantattributes &value ranges

Test knowledge

Refine knowledge

Query & report generationAggregation & sequencesAdvanced methods

Data warehousing

1

2

3

4

5

The KDD Process

Page 20: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 20

Starting Points: Data or Mining?

Nature of Data

Flat data tablesRelational databaseTemporal & Spatial TransactionMultimedia dataTextWeb

Mining tasks and methods

Classification/PredictionDecision treesNeural networkRule inductionetc.

DescriptionAssociation analysisClusteringetc.

Page 21: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 21

Outline

Basic concepts of KDD

KDD techniques: classification, association, clustering, visualizationChallenges and trends in KDD

KDD and high performance computing

Case studies in medicine data mining

Page 22: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 22

Predictive mining tasks perform inference on the current data in order to make predictions

Descriptive mining tasks characterize the general properties of the data in the database

Primary task of KDD

Page 23: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 23

Patterns

ModelsA model is a global description of a data set, a high level population or large sample perspective

A pattern is a low level summary of a relationship, perhaps which holds only for a few records or for only a few variables (local)A pattern is seen as a statement S in a language L that describes a subset D(S) of a database D with a quality q(S)

A model tells us about correlation between variables (regression), about hierarchies of clusters (clustering), a neural network, etc.

IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15THEN Prediction = VIRUS [87,5%]

Discovery of Patterns and/or Models

Page 24: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 24

color #nuclei #tails class

H1 light 1 1 healthy

H2 dark 1 1 healthy

H3 light 1 2 healthy

H4 light 2 1 healthy

C1 dark 1 2 cancerous

C2 dark 2 1 cancerous

C3 light 2 2 cancerous

C4 dark 2 2 cancerous

Datasets: Cancerous and Healthy Cells

H1

C3

H3 H4

H2

C2C1

C4

Page 25: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 25

Classification/Prediction

Classification is the process of finding a set of models (or functions) that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown

Decision treesIF-THEN rulesNeural networksMathematical formulaeetc.

Page 26: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 26

ClassificationAlgorithms

If color = darkand # tails = 2

Then cancerous cell

H1

H3 H4

H2

C2C1

training data

Classifier(model)

Unknown case

Classification—A Two-Step Process

Cancerous?

Model construction Model usage

Page 27: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 27

Comparing Classification Methods

Predictive accuracy: the ability of the classifier to correctly predict unseen data

Speed: refers to computation cost

Robustness: the ability of the classifier to make correctly predictions given noisy data or data with missing values

Scalability: the ability to construct the classifier efficiently given large amounts of data

Interpretability: the level of understanding and insight that is provided by the classifier

Page 28: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 28

Mining with Decision Trees

#nuclei?

1 2

light dark

color?

light dark

1 2

#tails?H

H C

color?

#tails?

1 2

H C

C

H1

C3

H3 H4

H2

C2C1

C4

Page 29: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 29

General Algorithm for Tree Induction

1. Choose the “best” attribute by a given measure for attribute selection

2. Extend tree by adding new branch for each value of the attribute

3. Sorting training examples to leaf nodes

4. If examples in a node belong to one class Then Stop Else Repeat steps 1-4 for leaf nodes

5. Prune the tree to avoid over-fitting

Two steps: recursively generate the tree (1-4), and prune the tree (5)

Page 30: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 30

Measures for Attribute Selection

Page 31: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 31

Other Classification Methods

Neural NetworksInstance-based ClassificationGenetic AlgorithmsRough Set ApproachStatistical ApproachesSupport Vector Machinesetc.

Page 32: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 32

H1

C3

H3 H4

H2

C2C1

C4

Healthy

Cancerous

color = dark

# nuclei = 1

# tails = 2

Mining with Neural Networks

Page 33: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 33

Neural Networks

Advantagesprediction accuracy is generally highrobust, works when training examples contain errorsoutput may be discrete, real-valued, or a vector of several discrete or real-valued attributesfast evaluation of the learned target function

Criticismlong training timedifficult to understand the learned function (weights)not easy to incorporate domain knowledge

Page 34: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 34

Instance-based Classification

Instance-based classificationUsing most similarity individual instances known in the past to classify a new instance

Typical approachesk-nearest neighbor approach

Instances represented as points in a Euclidean space

Locally weighted regressionConstructs local approximation

Case-based reasoningUses symbolic representations and knowledge-based inference

Page 35: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 35

Genetics Algorithms (GA)

GA: based on an analogy to biological evolution

Each rule is represented by a string of bits

An initial population is created consisting of randomly generated rules

e.g., IF A1 and Not A2 then C2 can be encoded as 100

Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings

The fitness of a rule is represented by its classification accuracy on a set of training examples

Offsprings are generated by crossover and mutation

Page 36: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 36

Rough Set Approach

Rough sets are used to approximately or “roughly” define equivalent classes

A rough set for a given class C is approximated by two sets:

A lower approximation(certain to be in C)A upper approximation(possible to be in C)

Finding the minimal subsets (reducts) of attributes, dependencies in data, rules, etc.

X

Equivalence classes

Rough sets and Data Mining, T.Y. Lin, N. Cercone (eds.), Kluwer Academic Pub., 1997)

Rough Sets in Knowledge Discovery, L. Polkowski, A. Skowron (eds.), Physica-Verlag, 1998.

Page 37: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 37

Bayesian Classification

Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of problems

P(Ci|X) = probability that the instance X = <x1,…,xk> is of class Ci. Idea: assign to sample X the class label Ci such that P(Ci|X) is maximal

Bayesian theorem

Naïve assumption: attribute independence

Bayesian belief network allows a subset of the variables conditionally independent

P(X)))P(CC|P(X

X)|P(C iii =

Page 38: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 38

Market Basket Analysis

Analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets”

Helps develop marketing strategies by gaining insight into whichitems are frequently purchased together by customers

How often people buy onigiri and beer together?

Page 39: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 39

If color = lightand # nuclei = 1

Then # tails = 1(support = 12.5%; confidence = 50%)

If # nuclei = 2and cell = cancerous

Then # tails = 2(support = 25%;confidence = 100%)

H1

C3

H3 H4

H2

C2C1

C4

Mining with Association RulesAssociation: the presence of same color and # nuclei implies the presence of same # tails in the same record

Support: the proportion of times that the rule applies. Confidence: the proportion of times that the rule is correctApriori algorithm, R. Agrawal 1993

Page 40: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 40

Rule Measures: Support and Confidence

Example: Find all the rules X & Y ⇒ Z with minimum confidence and support

support s = probability that a transaction contains {X and Y and Z}confidence c = conditional probability that a transaction having {X and Y} also contains Z

If minimum support 50%, minimum confidence 50%:

A ⇒ C (s=50%, c=66.6%)C ⇒ A (s=50%, c=100%)

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Customer buys onigiri

Customer buys both Customerbuys beer

Page 41: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 41

Association Mining: Apriori Algorithm

It is composed of two steps:

1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count

2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence

Page 42: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 42

Association Mining: Apriori Principle

For rule A ⇒ C:support = support({A and C}) = 50%confidence = support({A and C})/support({A}) = 66.6%

The Apriori principle:Any subset of a frequent itemset must be frequent

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%

Page 43: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 43

The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation

1. Find the frequent itemsets: the sets of items that have minimum support

A subset of a frequent itemset must also be a frequent itemset

i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset

Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)

C1 … Li-1 Ci Li Ci+1 … Lk

2. Use the frequent itemsets to generate association rules.

Page 44: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 44

Example (min_sup_count = 2)

TID List of items_IDs

T100 I1, I2, I5T200 I2, I4T300 I2, I3T400 I1, I2, I4T500 I1, I3T600 I2, I3T700 I1, I3T800 I1, I2, I3, I5T900 I1, I2, I3

Itemset Sup.Count

{I1} 6 {I2} 7{I3} 6{I4} 2 {I5} 2

C1

Itemset Sup.Count

{I1} 6 {I2} 7{I3} 6{I4} 2 {I5} 2

L1

Transactional dataScan D for count of each candidate

Compare candidate support count with minimum support count

Page 45: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 45

Example (min_sup_count = 2)

Itemset{I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3}{I2, I4}{I2, I5}{I3, I4}{I3, I5}{I4, I5}

C2

Scan D for count of each candidate

Itemset S.count{I1, I2} 4 {I1, I3} 4 {I1, I4} 1 {I1, I5} 2 {I2, I3} 4{I2, I4} 2{I2, I5} 2{I3, I4} 0{I3, I5} 1{I4, I5} 0

C2Compare candidate support count with minimum support count

Itemset S.count{I1, I2} 4 {I1, I3} 4 {I1, I5} 2 {I2, I3} 4{I2, I4} 2{I2, I5} 2

L2

Generate C3 candidates from L2

Itemset

{I1, I2, I3} {I1, I2, I5}

Scan D for count of each candidate

Itemset Sc

{I1, I2, I3} 2 {I1, I2, I5} 2

C3

Compare candidate support count with minimum support count

Itemset Sc

{I1, I2, I3} 2 {I1, I2, I5} 2

L3

Page 46: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 46

Mining with Clustering

Clustering analyzes data objects without consulting a known class label.

The objects are clustered or grouped based on the principle of maximizing the intra-class and minimizing the inter-class similarity

Partition-based clustering for large sets of numerical data.

Hierarchical clustering with at least O(n2) time complexity seems not be suitable for very large datasets

Page 47: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 47

What is Cluster Analysis?A cluster is a collection of data objects satisfying

Objects in this cluster are similar to one another

Objects in this cluster are dissimilar to the objects in other clusters

The process of grouping objects into clusters is called clustering

Page 48: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 48

Clustering in Different Fields

Statistics: since many years, focus on distance-based clustering (S-Plus, SPSS, SAS)

Machine learning: unsupervised learning. In conceptual clustering, a group of objects forms a class only if it is described by a concept

KDD: Efficient and effective clustering of large databases: scalability, complex shapes and types of data, high dimensional clustering, mixed numerical and categorical data

Page 49: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 49

What is Good Clustering?

A good clustering method will produce high quality clusters with

high intra-class similarity (within a class)

low inter-class similarity (between classes)

The quality of clustering basically depends on the similarity measure and the cluster representative used by the method

Page 50: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 50

Typical Requirements of Clustering

ScalabilityAbility to deal with different types of attributesDiscovery of clusters with arbitrary shapeMinimal requirements for domain knowledge to determine input parametersAbility to deal with noisy dataInsensitivity to the order of input recordsHigh dimensionalityConstraint-based clusteringInterpretability and usability

Page 51: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 51

Clustering Methods in KDD

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Methods

Page 52: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 52

Partitioning Methods

Given n objects and k as number of clusters to form. A partitioning algorithm organizes the objects into a partition of k clusters

The clusters are formed to optimize an objective partitioning criterion so that the objects within a cluster are “similar” , whereas the objects of different classes are “dissimilar”

Page 53: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 53

K-means Algorithm (K=2)Two centersselected randomlyfrom nobjects

Form twoclusters byassigningeach object toits nearest center

Reformtwo new clusters

Calculatetwo newcenters

Calculatetwo newcenters

Repeatstep 2 and 3untilthe stoppingconditions hold

1 2

3 4

Page 54: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 54

Partitioning Methods

The k-means algorithm is sensitive to outliers

The k-medoids method uses medoid (the most centrally located object in a cluster)

The EM (Expectation Maximization) algorithm: assigns to a cluster according to a weight representing the probability of membership.

PAM (Partitioning Around Medoids)

From k-Medoids to CLARA (Clustering LARgeApplications)

From CLARA to CLARANS (Clustering LARgeApplications based on RANdomized Search)

Page 55: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 55

Hierarchical Methods

A hierarchical clustering is a sequence of partitions in which each partition is nested into the next (previous) partition in the sequence.

Partition Q is nested into partition P if every component of Q is a subset of a component of P.

{ }},,{},,{},,,,,{ 65382109741 xxxxxxxxxxP =

{ }},{},{},,{},,{},,,{ 63582107941 xxxxxxxxxxQ =

Page 56: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 56

Hierarchical Clustering: Chameleon

Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling. Clusters are merged if the interconnectivity and closeness between them are highly related to the internal interconnectivity and closeness of objects within the clusters.

Page 57: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 57

Density-based Methods

Typically regards clusters as dense regions of objects in the data space that are separated by regions of low density

DBSCAN: Based on Connected Regions with Sufficiently High Density (Nearest Neighbor Estimation)

DENCLUE: Based on Density Distribution Functions (Kernel Estimation)

DBScan result for DS2 with MinPts at 4 and Eps at (a) 5.0, (b) 3.5 and (c) 3.0

Page 58: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 58

Data and Knowledge Visualization

Sunday11-12 PM

Lunch time

Tree map

Cone tree

Fisheye view

Hyperbolic tree

MagicLens

Page 59: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 59

KDD Products and Tools

SPSS

IBM

Silicon Graphics SASSalford Systems

RuleQuest Research (C4.5)

Page 60: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 60

Outline

Basic concepts of KDD

KDD techniques: classification, association, clustering, visualization

Challenges and trends in KDD

KDD and high performance computing

Case studies in medicine data mining

Page 61: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 61

Challenges of KDD

Different types of data in different forms(mixed numeric, symbolic, text, image, voice,…)

Large data sets (106-1012 bytes) and high dimensionality (102-103 attributes)[Problems: efficiency, scalability?]

[Problems: quality, effectiveness?]

Data and knowledge are changing

Human-Computer Interaction and Visualization

Page 62: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 62

3 attributes each has 2 values: #instances = 23 = 8 #patterns =27

What if #attributes increases?

Size of instance space and pattern space increased exponentially

p attributes each has d values, size of instance space is dp

38 attributes each has 10 values: #instances = 1038

Large Datasets and High Dimensionality

H1

C3

H3 H4

H2

C2C1

C4

Page 63: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 63

Scalable and efficient algorithms (scalable: given an amount of main memory, its runtime increases linearly with the number of input instances)

Sampling (instance selection)

Dimensionality reduction (feature selection)

Approximation methods

Massively parallel processing

Integration of machine learning and database management

Possible Solutions

Page 64: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 64

Attribute Numerical Symbolic

No structure

≠= Places,Color

Ordinal structure

≥≠= Ring

structure

Rank,Resemblance

Age,Temperature,Taste,

Income,Length

Nominal(categorical)

Ordinal

Measurable

Numerical vs. Symbolic DataCombinatorial search in hypothesis spaces (machine learning)

Often matrix-based computation (multivariate data analysis)

×+≥≠=

Page 65: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 65

Issues of Decision Tree Mining

Attribute selection

Pruning trees

From trees to rules (high cost of pruning)

Visualization

Data access: recent development on very large training sets, fast, efficient and scalable (in-memory and secondary storage)

(well-known systems: C4.5 and CART)

Page 66: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 66

Scalable Decision Tree Induction Methods

SLIQ (Mehta et al., 1996)builds an index for each attribute and only class list and the current attribute list reside in memory

SPRINT (J. Shafer et al., 1996)constructs an attribute list data structure

PUBLIC (Rastogi & Shim, 1998)integrates tree splitting and tree pruning: stop growing the tree earlier

RainForest (Gehrke, Ramakrishnan & Ganti, 1998)separates the scalability aspects from the criteria that determine the quality of the treebuilds an AVC-list (attribute, value, class label)

Page 67: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 67

Effectively address the weakness of the symbolic AI approach in knowledge discovery (grow of the hypothesis space)

Extracting or making sense of numeric weights associated with the interconnections of neurons to come up with a higher level of knowledge has been and will continue to be a challenge problem

Issues of Neural Network Mining

Page 68: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 68

Improving the efficiencyDatabase scan reduction: partitioning (Savaseve 95), hashing (Park 95), sampling (Toivonen 96), dynamic itemset counting (Brin 97), find non-redundant rules (3000 times less, Zaki KDD’2000)

Parallel mining of association rules

New measures of associationInterestingness and exceptional rules

Generalized and multiple-level rules

Issues of Association Rule Mining

Page 69: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 69

Mining Scientific Data

Data Mining in Bioinformatics

Data Mining the Astronomy and Earth Sciences

Mining Physics and Chemistry data

Mining Large Image Databases

etc.

Page 70: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 70

Some Advanced Techniques

Support Vector Machines

Independent Component Analysis

Level Sets and Data Mining

Multi-Relational Data Mining and Logic Programming

Ensemble Methods

Distributed and High Performance Computing

etc.

Page 71: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 71

Outline

Basic concepts of KDD

KDD techniques: classification, association, clustering, visualization

Challenges and trends in KDD

KDD and high performance computing

Case studies in medicine data mining

Page 72: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 72

Scalable and efficient algorithms scalable: given an amount of main memory, its runtime increases linearly with the number of input instances

Massively parallel processingData-parallel vs. Control-parallel Data Mining

Client/Server Frameworks for Parallel Data Mining

Possible Solutions

Mining Very Large Databases With Parallel ProcessingAlex A. Freitas & Simon H. Lavington, Kluwer Academic Publishers, 1998

Page 73: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 73

Mixed Similarity Measures (MSM): Goodall (1966) time O(n3), Diday and Gowda (1992),

Ichino and Yaguchi (1994),

Li & Biswas (1997) Time O(n2logn2), Space O(n2):

New and Efficient MSM (Binh & Bao, 2000):

Time and Space O(n):

Example of a Scalable Algorithm:Mixed Similarity Measure

*ˆ 1ˆijij PP −=

ijP*ijP

Page 74: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 74

Comparative ResultsUS Census database 33 sym + 8 num attributes, Alpha 21264, 500 MHz, RAM 2 GB, Solaris OS (Nguyen N.B. & Ho T.B., PKDD 2000)

#cases 500 1.000 1.500 2.000 5.000 10.000 199.523 (0.2M) (0.5M) (0.9M) (1.1M) (2.6M) (5.2M) (102M)

# values 497 992 1.486 1.973 4.858 9.651 97.799

time of LiBis 67.3s 26m6.2 1h46m31s 6h59m45s >60h not app not app O(n2logn2)

Time of OURS 0.1s 0.2s 0.3s 0.5s 2.8s 9.2s 36m26sO(n)

Memory of LiBis 5.3M 20.0M 44.0M 77.0M 455.0M not app not app O(n2)Memory of OURS 0.5 M 0.7M 0.9M 1.1M 2.1M 3.4M 64.0MO(n)

Preprocessing 0.1s 0.1s 0.2s 0.5s 0.9s 6.2s 127.2s

Page 75: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 75

Approaches of High Performance Computing to Data Mining

approaches

Data-oriented

discretization

Attribute selection

Instance selection(sampling)

Fast algorithms

Distributed mining

Parallel mining

Single sampling

Iterative sampling

Restricted search

Algorithm optimization

Voting

Model integration

Meta-learning

Inter-processor cooperation

Inter-algorithm parallelization

Algorithm-oriented

Inter-algorithm parallelization

Page 76: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 76

Distributed & Parallel Data Mining

Data set to

be mined

Subset 1 Alg.

Combine

Know.

Subset P Alg.Know.

Know.

... ... ...

Data set to

be mined

Alg.

Combine

Know.

Alg.Know.

Know.... ...

Distributed System

Parallel System

Page 77: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 77

Parallel Data Mining

Rule inductionDecision treesNeural networksGenetic algorithmsRough setsAssociation rulesClusteringetc.

1. Parallel Data Mining without DBMS Facilities2. Parallel Data Mining with Database Facilities

newcase

storedcases

subset 1Local MIN

Processor 1Global MIN

local nearest case

storedcases

subset pLocal MIN

Processor p

local nearest case

nearest case

Exploiting data parallelism in instance-based learning

Page 78: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 78

Outline

Basic concepts of KDD

KDD techniques: classification, association, clustering, visualization

Challenges and trends in KDD

KDD and high performance computing

Case studies in medicine data mining

Page 79: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 79

Mining Stomach Cancer Data

Each year about 50,000 people die in Japan by stomach cancer. Expect to use data mining methods to find new/useful knowledge.

The project started in summer 1999, including three data mining groups, and doctors at National Cancer Center in Tokyo.

The stomach cancer database was collected during 40 years (1962-1991). Transformed data table contains data of 6,712 patients described by 83 numeric and categorical attributes.

Page 80: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 80

Overview of Our Data Mining Work

Understand the domainand Define problems

Preprocess Data

Data MiningExtract Patterns/Models

Interpret and Evaluatediscovered knowledge

Putting the resultsin practical use

- Use pre-operative data to predict the patient stage after the operation

- alive (3275), deathafter 5 years (575), death after 90 days (2552), deathwithin 90 days(302), unknown (8).

- Transform data: converting categorical many-value attributes(280) into binary attributes

- Construct the target attribute- Selection of 31 significant

attributes by KJ and SFG methods

- Learn decision trees by See5 and CABRO with treevisualization

- Learn prediction rules by CBA, Rosetta and ourmethod LUPC

- Meeting with medical experts every two months to evaluate the results

- Scores (1 – 5) are given to “Acceptability”, “Novelty” and “Utility” of discovered patterns

- Data mining and evaluation are off-line

1

3

2

4

5

Page 81: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 81

Learned Decision Trees with CABRO

Tightly-coupled views

T2.5D views (Trees 2.5 Dimensions)

Induced decision trees with graphical representation (easy to observe and interpret)

Page 82: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 82

Learned Rules and Expert Evaluation

1.2.3.4.5

1.2.3.4.5

1.2.3.4.5

1.2.3.4.5

1.2.3.4.5

1.2.3.4.5

1.2.3.4.5

1.2.3.4.5

1.2.3.4.5

IF dcancer = S AND serosal = 3 ANDeritoneal = 0 AND apnemia = 0 ANDTHEN death < 90days

IF dcancer = x AND type = B3 AND

peritoneal = 0 AND liver_metastasis = 3THEN death < 90days

IF sex = M AND age < 73 AND

liver_metastasis = 3 AND cardio = 1THEN death < 90days

UtilityNoveltyAcceptabilitySome discovered rules

Most rules found are Most rules found are not newnot new to medical expertsto medical expertsVery high false negative error in the (minority) target class

Page 83: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 83

User-centered Data Mining

Active participationof the user (domain experts) in the KDD process and model selection

Putting the visualization power in the KDD process

Putting domain knowledge in mining

Page 84: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 84

Visualization in the KDD Process

Synergistic visualization of data & knowledge into knowledge discovery context

Appropriate interactive visualizationtechniques in the knowledge discovery process

Page 85: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 85

Significant Hypothesis Detected by Visualization

Some instances in class “alive”are with metastasis = 3

Page 86: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 86

Putting Domain Knowledge in Mining

Exclusive constraints: If imposed, D2MS will find only rules that do not contain any of such constraints (attribute-value pairs) in their condition part.

Inclusive constraints: If imposed, D2MS find only rules each of them must contain at least one of such constraints (attribute-value pairs) in their condition part.

Page 87: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 87

Putting Domain Knowledge in Mining

Finding irregular rules

Find only rules for class “death within 90 days” that do not contain the characterized attribute “liver_metastasis”and/or its combination with two other typical attributes, “Peritoneal_metastasi”and “Serosal_invasion” by exclusive constraints.

Rule 8 acc = 1.0 (4/4), cover = 0.001 (4/6712)

IF category = R AND sex = F AND proximal_third = 3 AND middle_third = 1

THEN death within 90 days

Finding rare rules

Find rules in the class “alive”that contain the symptom “liver_metastasis” by inclusive constraints.

Rule 1 acc = 0.500 (2/4); cover = 0.001(4/6712)

IF sex = M AND type = B1 AND liver_metastasis = 3AND middle_third = 1

THEN class = alive

Page 88: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 88

Mining Hepatitis Data with Temporal Abstraction

The hepatitis relational database collected during 1982-2001 at the Chiba university hospital

Our process of mining hepatitis data with temporal abstraction goes through six steps

Page 89: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 89

Temporal Abstraction Problems & Data Analysis

Structure and problem of temporal abstractionStructure of basic temporal abstraction

<episode, state & trend>example: <ALB 3 months, low & decreasing>Problems: finding episodes, states, and trends.

For example, when visualizing the relation between GOT, GPT, TTT, ZTT and fibrosis stages of one patient during 1985-1993, we observed that the values of GOT, GPT, TTT, and ZTT decrease when fibrosis becomes less severe.

Analysis of data by statistics and visualization tools

Page 90: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 90

Abstracted Data and Primary ResultsFrom the relational and temporal database, we derived abstracted

descriptions and converted into symbolic data in the flat data tables.

Most rules for hepatitis B and C match from 2% to 5% of the database with high accuracy. The accuracy with 10-cross validation is somehow higher than 70%.

By using system D2MS we found different rules sets and decision trees for distinguishing hepatitis B and C, as well the fibrosis stages.

The patient in the first row has abstractions on “ALB 3 months” as “normal & decreasing-decreasing”

(N-DD), on “ALB 6 months” as “normal & decreasing-stable” (N-

DS), etc.

Abstracted data

Original data

Extracted rules

Page 91: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 91

Rules Contradict with Human’s Belief

Short term change: GOT (up), GPT (up), TTT (up), ZTT (up).

Long term change: T-CHO (down), CHE (down), ALB (down), TP (down), PLT (down), WBC (down), HGB (down), T-BIL (up), D-BIL (up), I-BIL (up), ICG-15 (up).

Many rules found are contradict with human’s belief

Rule 2 : accuracy = 1.0 (12/12); coverage = 0.028 (2/426)IF ALB2 = normal & decreasing-decreasing

GOT4 = normal & decreasing-decreasingTTT4 = normal & decreasing-decreasing

THEN class = fibrosis stage F1

Page 92: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 92

Rules Characterizing HBV and HCV

Example of a rule for hepatitis C

The rules show the difference in temporal patterns between HBV and HCV

Page 93: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 93

Rules Characterizing Fibrosis Stages

Example of a rule characterizing fibrosis stage F4.

The rules show the difference in temporal patterns between fibrosis stagesF0, F1, …, F4

Page 94: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 94

Summary

KDD concepts, methods, challenges, examples

KDD is a new, fast growing interdisciplinary field for both research and application

Speed up KDD algorithms is crucial

Page 95: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science

PDCAT 2002, T.B. Ho 95

Recommended References

http://www.kdnuggets.com

David J. Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining, MIT Press, 2000

Jiawei Han, Micheline Kamber, Data Mining : Concepts and Techniques, Morgan Kaufmann, 2000

U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.