an introduction to knowledge discovery and data miningbao/talks/pdcattutorial.pdf · an...

An Introduction to Knowledge Discovery and Data Mining

TuBao HoSchool of Knowledge ScienceJapan Advanced Institute of Science and Technology

PDCAT 2002, T.B. Ho 2

Outline

Basic concepts of KDD

KDD techniques: classification, association, clustering, visualization

Challenges and trends in KDD

KDD and high performance computing

Case studies in medicine data mining


Un-interpretedsignals1ｓｔ 2nd 3rd 4th …25 27 21 26 …

data equipped with meaning(temperature of the days)

integrated information, including facts and their relations (“justified true belief”)(E = mc2)

Data, Information, Knowledge

Data mining metaphor: extractingore from rock


1. ( 5.6, 8.5)2. ( 6.0, 13.0)3. (11.0, 12.0)4. (11.0, 19.0)5. (13.5, 10.0)6. (16.5, 20.0)7. (17.5, 15.0)8. (17.5, 5.0)9. (22.5, 25.0)10. (26.0, 7.5)11. (30,0, 9.0)12. (30.0, 18.0)13. (30.0, 30.0)14. (31.0, 14.0)15. (32.5, 25.0)16. (38.0, 12.0)17. (41.0, 9.0)18. (41.0, 22.0)19. (43.5, 12.5)20. (44.0, 27.5)21. (45.0, 22.5)22. (48.0, 28.0)23. (52.5, 21.0)24. (53.5, 32.0)25. (54.0, 27.5)26. (57.5, 18.0)27. (59.0, 18.0)28. (62.5, 32.5)29. (63.0, 18.0)“if income < $33K, then the person has defaulted on the loan”

Mean of Debt = 18.4, Mean of Income = 34.5

33

US$ K(income, debt)

0

34.5, 18.4

(information)

(knowledge)

Have defaultedon the loan

Good statuswith the bank

Debt

Income

Data, Information, Knowledge


Knowledge Discovery and Data Mining (KDD)

106-1012 bytes:never see the whole data set or put it in thememory of computers

What knowledge?How to represent and use it?

Data mining algorithms?

the automatic extraction of non-obvious, hidden knowledge (patterns/models) from large volumes of data


...10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS, VIRUS

12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA

15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA

16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ?, negative, ?, n, n, ABSCESS, VIRUS...

IF cell_poly <= 220 AND Risk = n AND Loc_dat = + ANDNausea > 15 THEN Prediction = VIRUS [confidence = 87,5%]

From Data to Knowledge

Meningitis data, Tokyo Med. & Dental Univ., 38 attributes

numerical categorical missing class attribute


DatabasesStore, access, search, update data (deduction)

Statistics Infer information from data (deduction and induction, mainly numeric data)

Machine LearningComputer algorithms that improve automatically through experience (mainly induction, symbolic data)

KDD

KDD: An Interdisciplinary Field

also Algorithmics, Visualization, Data warehouses, OLAP, etc.


KDD’95, 96, 97, 98, 99, 00, 01, 02 (ACM, America)PAKDD’97, 98, 99, 00, 01, 02 (Pacific Rim & Asia)PKDD’97, 98, 99, 00, 01, 02 (Europe)ICDM’01, 02 (IEEE), SDM’01, 02 (SIAM)

Industrial Interest: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, …

Japan: FGCS Project focus on logic programming and reasoning; attention has been paid on knowledge acquisition and machine learning.2001-2004: “Active Mining Project”

KDD: New and Fast Growing Area


High-powered computers (larger disks, faster cpus) and networked data become widely available

People gathered and stored so much data because they think some valuable assets are implicitly coded within it. Its true value depends on the ability to extract useful information

Impractical manual data analysis

How to acquire knowledgefor knowledge–based systems remains as the main difficult and crucial AI problem

Why KDD?


Relational DatabasesA relational database is a collection of tables, each of which is assigned a unique name, and consists of a set of attributes and a set of tuples.

Cust-ID name address age income credit-info .C1 Smith, Sandy 5463 E Hasting, Burnaby 21 $27000 1 …

BC V5A 459, Canada … … … … … … …

Item-ID name brand category type price place-made supplier cost I3 high-res-TV Toshiba high resolution TV $988.00 Japan NIkoX $600.00I8 multidisc- Sanyo multidisc CD player $369.00 Japan MusicFont $120.00

… CDplayer … … … … … … …

customer

item

Emp-ID name category group salary commisionE35 Jones, Jane home entertainmentl manager $18,000 2%… … … … … …

employee

Branch-ID name addressB1 City square 369 Cambie St., Vancouver, BC V5L 3A2, Canada… … …

branch

Trans-ID cust-ID empl-ID data time method-paid amountT100 C1 B55 01/21/98 15:45 Visa $1357.00… . … … … … … …

purchases

Trnas-ID item-ID sty

T100 I3 1T100 I8 2… … …

Empl-ID branch-ID

E55 B1… …

Item-sold works-at


Data Warehouses

A data warehouse is a repository of information collected from multiple resources, stored under a unified schema, and which is usually resides at a single site.

Data sourcein Chicago

Data sourcein New York

Data sourcein Vancouver

Data sourcein Toronto

CleanTransformIntegrateLoad

Data warehouse

Query andanalysis tool

client

client


Transactional Databases

A transactional database consists of a file where each record represents a transaction. A transaction typically includes a unique transaction identity number (trans_ID), and list of the items making up the transaction

Trans_ID list of item_ID

T100 I1, I3, I8, I16T200 I3, I5, I23…. …


Object-Oriented Databases

Object-Relational Databases

Spatial Databases

Temporal Databases and Time-Series Databases

Text Databases and Multimedia Databases

Heterogeneous Databases and Legacy Databases

The World Wide Web

Advanced Database Systems


Spatial databases contain spatial-related information: geographic databases, VLSI chip design databases, medical and satellite image databases etc.

Data mining may uncover patterns describing the characteristics of houses located near a specified kind of location, the climate of mountainous areas located at various altitudes, etc.

Spatial Databases Japanese earthquakes

1961-1994


Temporal and Time-Series Databases

They store time-related data. A temporal database stores relational data that include time-related attributes (timestamps with different semantics). A time-series database stores sequences of values that change with time (stock exchange)

Data mining finds the characteristics of object evolution, trend of change for objects: e.g., stock exchange data can be mined to uncover trends in investment strategies


Text and Multimedia Databases

Text databases contain documents, usually highly unstructured or semi-structured. To uncover general descriptions of object classes, keywords, content associations, clustering behavior of text objects, etc.

Multimedia databases store image, audio, and video data: picture content-based retrieval, voice-email systems, video-on-demand-systems, speech-based user interface, etc.


The Web provides an enormous source of explicit and implicit knowledge that people can navigate and search for what they need.

Example: When examining the data collected from Internet Mart, heavily trodden paths gave BT hints to regions of the site which were of key interest to its visitors.

The World Wide Web


KDD is inherentlyinteractive and iterative

a step in the KDD process consisting of methods that produce useful patterns or models from the data

1

3

4

5

Understand the domainand Define problems

Collect andPreprocess Data

Data MiningExtract Patterns/Models

Interpret and Evaluatediscovered knowledge

Putting the resultsin practical use

Maybe 70-90% of effort and cost in KDD

The KDD Process

2


Data organized by function

Create/selecttarget database

Select samplingtechnique and

sample data

Supply missing values

Normalizevalues

Select DM task (s)

Transform todifferent

representation

Eliminatenoisy data

Transformvalues

Select DM method (s)

Create derivedattributes

Extract knowledge

Find importantattributes &value ranges

Test knowledge

Refine knowledge

Query & report generationAggregation & sequencesAdvanced methods

Data warehousing

1

2

3

4

5

The KDD Process


Starting Points: Data or Mining?

Nature of Data

Flat data tablesRelational databaseTemporal & Spatial TransactionMultimedia dataTextWeb

Mining tasks and methods

Classification/PredictionDecision treesNeural networkRule inductionetc.

DescriptionAssociation analysisClusteringetc.


Outline


KDD techniques: classification, association, clustering, visualizationChallenges and trends in KDD




Predictive mining tasks perform inference on the current data in order to make predictions

Descriptive mining tasks characterize the general properties of the data in the database

Primary task of KDD


Patterns

ModelsA model is a global description of a data set, a high level population or large sample perspective

A pattern is a low level summary of a relationship, perhaps which holds only for a few records or for only a few variables (local)A pattern is seen as a statement S in a language L that describes a subset D(S) of a database D with a quality q(S)

A model tells us about correlation between variables (regression), about hierarchies of clusters (clustering), a neural network, etc.

IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15THEN Prediction = VIRUS [87,5%]

Discovery of Patterns and/or Models


color #nuclei #tails class

H1 light 1 1 healthy

H2 dark 1 1 healthy



C1 dark 1 2 cancerous


C3 light 2 2 cancerous


Datasets: Cancerous and Healthy Cells

H1

C3

H3 H4

H2

C2C1

C4


Classification/Prediction

Classification is the process of finding a set of models (or functions) that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown

Decision treesIF-THEN rulesNeural networksMathematical formulaeetc.


ClassificationAlgorithms

If color = darkand # tails = 2

Then cancerous cell

H1

H3 H4

H2

C2C1

training data

Classifier(model)

Unknown case

Classification—A Two-Step Process

Cancerous?

Model construction Model usage


Comparing Classification Methods

Predictive accuracy: the ability of the classifier to correctly predict unseen data

Speed: refers to computation cost

Robustness: the ability of the classifier to make correctly predictions given noisy data or data with missing values

Scalability: the ability to construct the classifier efficiently given large amounts of data

Interpretability: the level of understanding and insight that is provided by the classifier


Mining with Decision Trees

#nuclei?

1 2

light dark

color?

light dark

1 2

#tails?H

H C

color?

#tails?

1 2

H C

C

H1

C3

H3 H4

H2

C2C1

C4


General Algorithm for Tree Induction

1. Choose the “best” attribute by a given measure for attribute selection

2. Extend tree by adding new branch for each value of the attribute

3. Sorting training examples to leaf nodes

4. If examples in a node belong to one class Then Stop Else Repeat steps 1-4 for leaf nodes

5. Prune the tree to avoid over-fitting

Two steps: recursively generate the tree (1-4), and prune the tree (5)


Measures for Attribute Selection


Other Classification Methods

Neural NetworksInstance-based ClassificationGenetic AlgorithmsRough Set ApproachStatistical ApproachesSupport Vector Machinesetc.


H1

C3

H3 H4

H2

C2C1

C4

Healthy

Cancerous

color = dark

# nuclei = 1

# tails = 2

Mining with Neural Networks


Neural Networks

Advantagesprediction accuracy is generally highrobust, works when training examples contain errorsoutput may be discrete, real-valued, or a vector of several discrete or real-valued attributesfast evaluation of the learned target function

Criticismlong training timedifficult to understand the learned function (weights)not easy to incorporate domain knowledge


Instance-based Classification

Instance-based classificationUsing most similarity individual instances known in the past to classify a new instance

Typical approachesk-nearest neighbor approach

Instances represented as points in a Euclidean space

Locally weighted regressionConstructs local approximation

Case-based reasoningUses symbolic representations and knowledge-based inference


Genetics Algorithms (GA)

GA: based on an analogy to biological evolution

Each rule is represented by a string of bits

An initial population is created consisting of randomly generated rules

e.g., IF A1 and Not A2 then C2 can be encoded as 100

Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings

The fitness of a rule is represented by its classification accuracy on a set of training examples

Offsprings are generated by crossover and mutation


Rough Set Approach

Rough sets are used to approximately or “roughly” define equivalent classes

A rough set for a given class C is approximated by two sets:

A lower approximation(certain to be in C)A upper approximation(possible to be in C)

Finding the minimal subsets (reducts) of attributes, dependencies in data, rules, etc.

X

Equivalence classes

Rough sets and Data Mining, T.Y. Lin, N. Cercone (eds.), Kluwer Academic Pub., 1997)

Rough Sets in Knowledge Discovery, L. Polkowski, A. Skowron (eds.), Physica-Verlag, 1998.


Bayesian Classification

Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of problems

P(Ci|X) = probability that the instance X = <x1,…,xk> is of class Ci. Idea: assign to sample X the class label Ci such that P(Ci|X) is maximal

Bayesian theorem

Naïve assumption: attribute independence

Bayesian belief network allows a subset of the variables conditionally independent

P(X)))P(CC|P(X

X)|P(C iii =


Market Basket Analysis

Analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets”

Helps develop marketing strategies by gaining insight into whichitems are frequently purchased together by customers

How often people buy onigiri and beer together?


If color = lightand # nuclei = 1

Then # tails = 1(support = 12.5%; confidence = 50%)

If # nuclei = 2and cell = cancerous

Then # tails = 2(support = 25%;confidence = 100%)

H1

C3

H3 H4

H2

C2C1

C4

Mining with Association RulesAssociation: the presence of same color and # nuclei implies the presence of same # tails in the same record

Support: the proportion of times that the rule applies. Confidence: the proportion of times that the rule is correctApriori algorithm, R. Agrawal 1993


Rule Measures: Support and Confidence

Example: Find all the rules X & Y ⇒ Z with minimum confidence and support

support s = probability that a transaction contains {X and Y and Z}confidence c = conditional probability that a transaction having {X and Y} also contains Z

If minimum support 50%, minimum confidence 50%:

A ⇒ C (s=50%, c=66.6%)C ⇒ A (s=50%, c=100%)

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Customer buys onigiri

Customer buys both Customerbuys beer


Association Mining: Apriori Algorithm

It is composed of two steps:

1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count

2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence


Association Mining: Apriori Principle

For rule A ⇒ C:support = support({A and C}) = 50%confidence = support({A and C})/support({A}) = 66.6%

The Apriori principle:Any subset of a frequent itemset must be frequent

Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F

Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%

Min. support 50%Min. confidence 50%


The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation

1. Find the frequent itemsets: the sets of items that have minimum support

A subset of a frequent itemset must also be a frequent itemset

i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset

Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)

C1 … Li-1 Ci Li Ci+1 … Lk

2. Use the frequent itemsets to generate association rules.


Example (min_sup_count = 2)

TID List of items_IDs

T100 I1, I2, I5T200 I2, I4T300 I2, I3T400 I1, I2, I4T500 I1, I3T600 I2, I3T700 I1, I3T800 I1, I2, I3, I5T900 I1, I2, I3

Itemset Sup.Count

{I1} 6 {I2} 7{I3} 6{I4} 2 {I5} 2

C1

Itemset Sup.Count

{I1} 6 {I2} 7{I3} 6{I4} 2 {I5} 2

L1

Transactional dataScan D for count of each candidate

Compare candidate support count with minimum support count


Example (min_sup_count = 2)

Itemset{I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3}{I2, I4}{I2, I5}{I3, I4}{I3, I5}{I4, I5}

C2

Scan D for count of each candidate

Itemset S.count{I1, I2} 4 {I1, I3} 4 {I1, I4} 1 {I1, I5} 2 {I2, I3} 4{I2, I4} 2{I2, I5} 2{I3, I4} 0{I3, I5} 1{I4, I5} 0

C2Compare candidate support count with minimum support count

Itemset S.count{I1, I2} 4 {I1, I3} 4 {I1, I5} 2 {I2, I3} 4{I2, I4} 2{I2, I5} 2

L2

Generate C3 candidates from L2

Itemset

{I1, I2, I3} {I1, I2, I5}

Scan D for count of each candidate

Itemset Sc

{I1, I2, I3} 2 {I1, I2, I5} 2

C3

Compare candidate support count with minimum support count

Itemset Sc

{I1, I2, I3} 2 {I1, I2, I5} 2

L3


Mining with Clustering

Clustering analyzes data objects without consulting a known class label.

The objects are clustered or grouped based on the principle of maximizing the intra-class and minimizing the inter-class similarity

Partition-based clustering for large sets of numerical data.

Hierarchical clustering with at least O(n2) time complexity seems not be suitable for very large datasets


What is Cluster Analysis?A cluster is a collection of data objects satisfying

Objects in this cluster are similar to one another

Objects in this cluster are dissimilar to the objects in other clusters

The process of grouping objects into clusters is called clustering


Clustering in Different Fields

Statistics: since many years, focus on distance-based clustering (S-Plus, SPSS, SAS)

Machine learning: unsupervised learning. In conceptual clustering, a group of objects forms a class only if it is described by a concept

KDD: Efficient and effective clustering of large databases: scalability, complex shapes and types of data, high dimensional clustering, mixed numerical and categorical data


What is Good Clustering?

A good clustering method will produce high quality clusters with

high intra-class similarity (within a class)

low inter-class similarity (between classes)

The quality of clustering basically depends on the similarity measure and the cluster representative used by the method


Typical Requirements of Clustering

ScalabilityAbility to deal with different types of attributesDiscovery of clusters with arbitrary shapeMinimal requirements for domain knowledge to determine input parametersAbility to deal with noisy dataInsensitivity to the order of input recordsHigh dimensionalityConstraint-based clusteringInterpretability and usability


Clustering Methods in KDD

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Grid-Based Methods

Model-Based Methods



Given n objects and k as number of clusters to form. A partitioning algorithm organizes the objects into a partition of k clusters

The clusters are formed to optimize an objective partitioning criterion so that the objects within a cluster are “similar” , whereas the objects of different classes are “dissimilar”


K-means Algorithm (K=2)Two centersselected randomlyfrom nobjects

Form twoclusters byassigningeach object toits nearest center

Reformtwo new clusters

Calculatetwo newcenters

Calculatetwo newcenters

Repeatstep 2 and 3untilthe stoppingconditions hold

1 2

3 4



The k-means algorithm is sensitive to outliers

The k-medoids method uses medoid (the most centrally located object in a cluster)

The EM (Expectation Maximization) algorithm: assigns to a cluster according to a weight representing the probability of membership.

PAM (Partitioning Around Medoids)

From k-Medoids to CLARA (Clustering LARgeApplications)

From CLARA to CLARANS (Clustering LARgeApplications based on RANdomized Search)


Hierarchical Methods

A hierarchical clustering is a sequence of partitions in which each partition is nested into the next (previous) partition in the sequence.

Partition Q is nested into partition P if every component of Q is a subset of a component of P.

{ }},,{},,{},,,,,{ 65382109741 xxxxxxxxxxP =

{ }},{},{},,{},,{},,,{ 63582107941 xxxxxxxxxxQ =


Hierarchical Clustering: Chameleon

Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling. Clusters are merged if the interconnectivity and closeness between them are highly related to the internal interconnectivity and closeness of objects within the clusters.


Density-based Methods

Typically regards clusters as dense regions of objects in the data space that are separated by regions of low density

DBSCAN: Based on Connected Regions with Sufficiently High Density (Nearest Neighbor Estimation)

DENCLUE: Based on Density Distribution Functions (Kernel Estimation)

DBScan result for DS2 with MinPts at 4 and Eps at (a) 5.0, (b) 3.5 and (c) 3.0


Data and Knowledge Visualization

Sunday11-12 PM

Lunch time

Tree map

Cone tree

Fisheye view

Hyperbolic tree

MagicLens


KDD Products and Tools

SPSS

IBM

Silicon Graphics SASSalford Systems

RuleQuest Research (C4.5)


Outline







Challenges of KDD

Different types of data in different forms(mixed numeric, symbolic, text, image, voice,…)

Large data sets (106-1012 bytes) and high dimensionality (102-103 attributes)[Problems: efficiency, scalability?]

[Problems: quality, effectiveness?]

Data and knowledge are changing

Human-Computer Interaction and Visualization


3 attributes each has 2 values: #instances = 23 = 8 #patterns =27

What if #attributes increases?

Size of instance space and pattern space increased exponentially

p attributes each has d values, size of instance space is dp

38 attributes each has 10 values: #instances = 1038

Large Datasets and High Dimensionality

H1

C3

H3 H4

H2

C2C1

C4


Scalable and efficient algorithms (scalable: given an amount of main memory, its runtime increases linearly with the number of input instances)

Sampling (instance selection)

Dimensionality reduction (feature selection)

Approximation methods

Massively parallel processing

Integration of machine learning and database management

Possible Solutions


Attribute Numerical Symbolic

No structure

≠= Places,Color

Ordinal structure

≥≠= Ring

structure

Rank,Resemblance

Age,Temperature,Taste,

Income,Length

Nominal(categorical)

Ordinal

Measurable

Numerical vs. Symbolic DataCombinatorial search in hypothesis spaces (machine learning)

Often matrix-based computation (multivariate data analysis)

×+≥≠=


Issues of Decision Tree Mining

Attribute selection

Pruning trees

From trees to rules (high cost of pruning)

Visualization

Data access: recent development on very large training sets, fast, efficient and scalable (in-memory and secondary storage)

(well-known systems: C4.5 and CART)


Scalable Decision Tree Induction Methods

SLIQ (Mehta et al., 1996)builds an index for each attribute and only class list and the current attribute list reside in memory

SPRINT (J. Shafer et al., 1996)constructs an attribute list data structure

PUBLIC (Rastogi & Shim, 1998)integrates tree splitting and tree pruning: stop growing the tree earlier

RainForest (Gehrke, Ramakrishnan & Ganti, 1998)separates the scalability aspects from the criteria that determine the quality of the treebuilds an AVC-list (attribute, value, class label)


Effectively address the weakness of the symbolic AI approach in knowledge discovery (grow of the hypothesis space)

Extracting or making sense of numeric weights associated with the interconnections of neurons to come up with a higher level of knowledge has been and will continue to be a challenge problem

Issues of Neural Network Mining


Improving the efficiencyDatabase scan reduction: partitioning (Savaseve 95), hashing (Park 95), sampling (Toivonen 96), dynamic itemset counting (Brin 97), find non-redundant rules (3000 times less, Zaki KDD’2000)

Parallel mining of association rules

New measures of associationInterestingness and exceptional rules

Generalized and multiple-level rules

Issues of Association Rule Mining


Mining Scientific Data

Data Mining in Bioinformatics

Data Mining the Astronomy and Earth Sciences

Mining Physics and Chemistry data

Mining Large Image Databases

etc.


Some Advanced Techniques

Support Vector Machines

Independent Component Analysis

Level Sets and Data Mining

Multi-Relational Data Mining and Logic Programming

Ensemble Methods

Distributed and High Performance Computing

etc.


Outline







Scalable and efficient algorithms scalable: given an amount of main memory, its runtime increases linearly with the number of input instances

Massively parallel processingData-parallel vs. Control-parallel Data Mining

Client/Server Frameworks for Parallel Data Mining

Possible Solutions

Mining Very Large Databases With Parallel ProcessingAlex A. Freitas & Simon H. Lavington, Kluwer Academic Publishers, 1998


Mixed Similarity Measures (MSM): Goodall (1966) time O(n3), Diday and Gowda (1992),

Ichino and Yaguchi (1994),

Li & Biswas (1997) Time O(n2logn2), Space O(n2):

New and Efficient MSM (Binh & Bao, 2000):

Time and Space O(n):

Example of a Scalable Algorithm:Mixed Similarity Measure

*ˆ 1ˆijij PP −=

ijP*ijP


Comparative ResultsUS Census database 33 sym + 8 num attributes, Alpha 21264, 500 MHz, RAM 2 GB, Solaris OS (Nguyen N.B. & Ho T.B., PKDD 2000)

＃cases 500 1.000 1.500 2.000 5.000 10.000 199.523 (0.2M) (0.5M) (0.9M) (1.1M) (2.6M) (5.2M) (102M)

# values 497 992 1.486 1.973 4.858 9.651 97.799

time of LiBis 67.3s 26m6.2 1h46m31s 6h59m45s >60h not app not app O(n2logn2)

Time of OURS 0.1s 0.2s 0.3s 0.5s 2.8s 9.2s 36m26sO(n)

Memory of LiBis 5.3M 20.0M 44.0M 77.0M 455.0M not app not app O(n2)Memory of OURS 0.5 M 0.7M 0.9M 1.1M 2.1M 3.4M 64.0MO(n)

Preprocessing 0.1s 0.1s 0.2s 0.5s 0.9s 6.2s 127.2s


Approaches of High Performance Computing to Data Mining

approaches

Data-oriented

discretization

Attribute selection

Instance selection(sampling)

Fast algorithms

Distributed mining

Parallel mining

Single sampling

Iterative sampling

Restricted search

Algorithm optimization

Voting

Model integration

Meta-learning

Inter-processor cooperation

Inter-algorithm parallelization

Algorithm-oriented

Inter-algorithm parallelization


Distributed & Parallel Data Mining

Data set to

be mined

Subset 1 Alg.

Combine

Know.

Subset P Alg.Know.

Know.

... ... ...

Data set to

be mined

Alg.

Combine

Know.

Alg.Know.

Know.... ...

Distributed System

Parallel System


Parallel Data Mining

Rule inductionDecision treesNeural networksGenetic algorithmsRough setsAssociation rulesClusteringetc.

1. Parallel Data Mining without DBMS Facilities2. Parallel Data Mining with Database Facilities

newcase

storedcases

subset 1Local MIN

Processor 1Global MIN

local nearest case

storedcases

subset pLocal MIN

Processor p

local nearest case

nearest case

Exploiting data parallelism in instance-based learning


Outline







Mining Stomach Cancer Data

Each year about 50,000 people die in Japan by stomach cancer. Expect to use data mining methods to find new/useful knowledge.

The project started in summer 1999, including three data mining groups, and doctors at National Cancer Center in Tokyo.

The stomach cancer database was collected during 40 years (1962-1991). Transformed data table contains data of 6,712 patients described by 83 numeric and categorical attributes.


Overview of Our Data Mining Work

Understand the domainand Define problems

Preprocess Data

Data MiningExtract Patterns/Models

Interpret and Evaluatediscovered knowledge

Putting the resultsin practical use

- Use pre-operative data to predict the patient stage after the operation

- alive (3275), deathafter 5 years (575), death after 90 days (2552), deathwithin 90 days(302), unknown (8).

- Transform data: converting categorical many-value attributes(280) into binary attributes

- Construct the target attribute- Selection of 31 significant

attributes by KJ and SFG methods

- Learn decision trees by See5 and CABRO with treevisualization

- Learn prediction rules by CBA, Rosetta and ourmethod LUPC

- Meeting with medical experts every two months to evaluate the results

- Scores (1 – 5) are given to “Acceptability”, “Novelty” and “Utility” of discovered patterns

- Data mining and evaluation are off-line

1

3

2

4

5


Learned Decision Trees with CABRO

Tightly-coupled views

T2.5D views (Trees 2.5 Dimensions)

Induced decision trees with graphical representation (easy to observe and interpret)


Learned Rules and Expert Evaluation

1.2.3.4.5

1.2.3.4.5

1.2.3.4.5

1.2.3.4.5

1.2.3.4.5

1.2.3.4.5

1.2.3.4.5

1.2.3.4.5

1.2.3.4.5

IF dcancer = S AND serosal = 3 ANDeritoneal = 0 AND apnemia = 0 ANDTHEN death < 90days

IF dcancer = x AND type = B3 AND

peritoneal = 0 AND liver_metastasis = 3THEN death < 90days

IF sex = M AND age < 73 AND

liver_metastasis = 3 AND cardio = 1THEN death < 90days

UtilityNoveltyAcceptabilitySome discovered rules

Most rules found are Most rules found are not newnot new to medical expertsto medical expertsVery high false negative error in the (minority) target class


User-centered Data Mining

Active participationof the user (domain experts) in the KDD process and model selection

Putting the visualization power in the KDD process

Putting domain knowledge in mining


Visualization in the KDD Process

Synergistic visualization of data & knowledge into knowledge discovery context

Appropriate interactive visualizationtechniques in the knowledge discovery process


Significant Hypothesis Detected by Visualization

Some instances in class “alive”are with metastasis = 3


Putting Domain Knowledge in Mining

Exclusive constraints: If imposed, D2MS will find only rules that do not contain any of such constraints (attribute-value pairs) in their condition part.

Inclusive constraints: If imposed, D2MS find only rules each of them must contain at least one of such constraints (attribute-value pairs) in their condition part.


Putting Domain Knowledge in Mining

Finding irregular rules

Find only rules for class “death within 90 days” that do not contain the characterized attribute “liver_metastasis”and/or its combination with two other typical attributes, “Peritoneal_metastasi”and “Serosal_invasion” by exclusive constraints.

Rule 8 acc = 1.0 (4/4), cover = 0.001 (4/6712)

IF category = R AND sex = F AND proximal_third = 3 AND middle_third = 1

THEN death within 90 days

Finding rare rules

Find rules in the class “alive”that contain the symptom “liver_metastasis” by inclusive constraints.

Rule 1 acc = 0.500 (2/4); cover = 0.001(4/6712)

IF sex = M AND type = B1 AND liver_metastasis = 3AND middle_third = 1

THEN class = alive


Mining Hepatitis Data with Temporal Abstraction

The hepatitis relational database collected during 1982-2001 at the Chiba university hospital

Our process of mining hepatitis data with temporal abstraction goes through six steps


Temporal Abstraction Problems & Data Analysis

Structure and problem of temporal abstractionStructure of basic temporal abstraction

<episode, state & trend>example: <ALB 3 months, low & decreasing>Problems: finding episodes, states, and trends.

For example, when visualizing the relation between GOT, GPT, TTT, ZTT and fibrosis stages of one patient during 1985-1993, we observed that the values of GOT, GPT, TTT, and ZTT decrease when fibrosis becomes less severe.

Analysis of data by statistics and visualization tools


Abstracted Data and Primary ResultsFrom the relational and temporal database, we derived abstracted

descriptions and converted into symbolic data in the flat data tables.

Most rules for hepatitis B and C match from 2% to 5% of the database with high accuracy. The accuracy with 10-cross validation is somehow higher than 70%.

By using system D2MS we found different rules sets and decision trees for distinguishing hepatitis B and C, as well the fibrosis stages.

The patient in the first row has abstractions on “ALB 3 months” as “normal & decreasing-decreasing”

(N-DD), on “ALB 6 months” as “normal & decreasing-stable” (N-

DS), etc.

Abstracted data

Original data

Extracted rules


Rules Contradict with Human’s Belief

Short term change: GOT (up), GPT (up), TTT (up), ZTT (up).

Long term change: T-CHO (down), CHE (down), ALB (down), TP (down), PLT (down), WBC (down), HGB (down), T-BIL (up), D-BIL (up), I-BIL (up), ICG-15 (up).

Many rules found are contradict with human’s belief

Rule 2 : accuracy = 1.0 (12/12); coverage = 0.028 (2/426)IF ALB2 = normal & decreasing-decreasing

GOT4 = normal & decreasing-decreasingTTT4 = normal & decreasing-decreasing

THEN class = fibrosis stage F1


Rules Characterizing HBV and HCV

Example of a rule for hepatitis C

The rules show the difference in temporal patterns between HBV and HCV


Rules Characterizing Fibrosis Stages

Example of a rule characterizing fibrosis stage F4.

The rules show the difference in temporal patterns between fibrosis stagesF0, F1, …, F4


Summary

KDD concepts, methods, challenges, examples

KDD is a new, fast growing interdisciplinary field for both research and application

Speed up KDD algorithms is crucial


Recommended References

http://www.kdnuggets.com

David J. Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining, MIT Press, 2000

Jiawei Han, Micheline Kamber, Data Mining : Concepts and Techniques, Morgan Kaufmann, 2000

U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.

an introduction to knowledge discovery and data miningbao/talks/pdcattutorial.pdf · an...

Documents