an introduction to knowledge discovery and data miningbao/talks/pdcattutorial.pdf · an...
TRANSCRIPT
An Introduction to Knowledge Discovery and Data Mining
TuBao HoSchool of Knowledge ScienceJapan Advanced Institute of Science and Technology
PDCAT 2002, T.B. Ho 2
Outline
Basic concepts of KDD
KDD techniques: classification, association, clustering, visualization
Challenges and trends in KDD
KDD and high performance computing
Case studies in medicine data mining
PDCAT 2002, T.B. Ho 3
Un-interpretedsignals1st 2nd 3rd 4th …25 27 21 26 …
data equipped with meaning(temperature of the days)
integrated information, including facts and their relations (“justified true belief”)(E = mc2)
Data, Information, Knowledge
Data mining metaphor: extractingore from rock
PDCAT 2002, T.B. Ho 4
1. ( 5.6, 8.5)2. ( 6.0, 13.0)3. (11.0, 12.0)4. (11.0, 19.0)5. (13.5, 10.0)6. (16.5, 20.0)7. (17.5, 15.0)8. (17.5, 5.0)9. (22.5, 25.0)10. (26.0, 7.5)11. (30,0, 9.0)12. (30.0, 18.0)13. (30.0, 30.0)14. (31.0, 14.0)15. (32.5, 25.0)16. (38.0, 12.0)17. (41.0, 9.0)18. (41.0, 22.0)19. (43.5, 12.5)20. (44.0, 27.5)21. (45.0, 22.5)22. (48.0, 28.0)23. (52.5, 21.0)24. (53.5, 32.0)25. (54.0, 27.5)26. (57.5, 18.0)27. (59.0, 18.0)28. (62.5, 32.5)29. (63.0, 18.0)“if income < $33K, then the person has defaulted on the loan”
Mean of Debt = 18.4, Mean of Income = 34.5
33
US$ K(income, debt)
0
34.5, 18.4
(information)
(knowledge)
Have defaultedon the loan
Good statuswith the bank
Debt
Income
Data, Information, Knowledge
PDCAT 2002, T.B. Ho 5
Knowledge Discovery and Data Mining (KDD)
106-1012 bytes:never see the whole data set or put it in thememory of computers
What knowledge?How to represent and use it?
Data mining algorithms?
the automatic extraction of non-obvious, hidden knowledge (patterns/models) from large volumes of data
PDCAT 2002, T.B. Ho 6
...10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS, VIRUS
12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA
15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA
16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ?, negative, ?, n, n, ABSCESS, VIRUS...
IF cell_poly <= 220 AND Risk = n AND Loc_dat = + ANDNausea > 15 THEN Prediction = VIRUS [confidence = 87,5%]
From Data to Knowledge
Meningitis data, Tokyo Med. & Dental Univ., 38 attributes
numerical categorical missing class attribute
PDCAT 2002, T.B. Ho 7
DatabasesStore, access, search, update data (deduction)
Statistics Infer information from data (deduction and induction, mainly numeric data)
Machine LearningComputer algorithms that improve automatically through experience (mainly induction, symbolic data)
KDD
KDD: An Interdisciplinary Field
also Algorithmics, Visualization, Data warehouses, OLAP, etc.
PDCAT 2002, T.B. Ho 8
KDD’95, 96, 97, 98, 99, 00, 01, 02 (ACM, America)PAKDD’97, 98, 99, 00, 01, 02 (Pacific Rim & Asia)PKDD’97, 98, 99, 00, 01, 02 (Europe)ICDM’01, 02 (IEEE), SDM’01, 02 (SIAM)
Industrial Interest: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, …
Japan: FGCS Project focus on logic programming and reasoning; attention has been paid on knowledge acquisition and machine learning.2001-2004: “Active Mining Project”
KDD: New and Fast Growing Area
PDCAT 2002, T.B. Ho 9
High-powered computers (larger disks, faster cpus) and networked data become widely available
People gathered and stored so much data because they think some valuable assets are implicitly coded within it. Its true value depends on the ability to extract useful information
Impractical manual data analysis
How to acquire knowledgefor knowledge–based systems remains as the main difficult and crucial AI problem
Why KDD?
PDCAT 2002, T.B. Ho 10
Relational DatabasesA relational database is a collection of tables, each of which is assigned a unique name, and consists of a set of attributes and a set of tuples.
Cust-ID name address age income credit-info .C1 Smith, Sandy 5463 E Hasting, Burnaby 21 $27000 1 …
BC V5A 459, Canada … … … … … … …
Item-ID name brand category type price place-made supplier cost I3 high-res-TV Toshiba high resolution TV $988.00 Japan NIkoX $600.00I8 multidisc- Sanyo multidisc CD player $369.00 Japan MusicFont $120.00
… CDplayer … … … … … … …
customer
item
Emp-ID name category group salary commisionE35 Jones, Jane home entertainmentl manager $18,000 2%… … … … … …
employee
Branch-ID name addressB1 City square 369 Cambie St., Vancouver, BC V5L 3A2, Canada… … …
branch
Trans-ID cust-ID empl-ID data time method-paid amountT100 C1 B55 01/21/98 15:45 Visa $1357.00… . … … … … … …
purchases
Trnas-ID item-ID sty
T100 I3 1T100 I8 2… … …
Empl-ID branch-ID
E55 B1… …
Item-sold works-at
PDCAT 2002, T.B. Ho 11
Data Warehouses
A data warehouse is a repository of information collected from multiple resources, stored under a unified schema, and which is usually resides at a single site.
Data sourcein Chicago
Data sourcein New York
Data sourcein Vancouver
Data sourcein Toronto
CleanTransformIntegrateLoad
Data warehouse
Query andanalysis tool
client
client
PDCAT 2002, T.B. Ho 12
Transactional Databases
A transactional database consists of a file where each record represents a transaction. A transaction typically includes a unique transaction identity number (trans_ID), and list of the items making up the transaction
Trans_ID list of item_ID
T100 I1, I3, I8, I16T200 I3, I5, I23…. …
PDCAT 2002, T.B. Ho 13
Object-Oriented Databases
Object-Relational Databases
Spatial Databases
Temporal Databases and Time-Series Databases
Text Databases and Multimedia Databases
Heterogeneous Databases and Legacy Databases
The World Wide Web
Advanced Database Systems
PDCAT 2002, T.B. Ho 14
Spatial databases contain spatial-related information: geographic databases, VLSI chip design databases, medical and satellite image databases etc.
Data mining may uncover patterns describing the characteristics of houses located near a specified kind of location, the climate of mountainous areas located at various altitudes, etc.
Spatial Databases Japanese earthquakes
1961-1994
PDCAT 2002, T.B. Ho 15
Temporal and Time-Series Databases
They store time-related data. A temporal database stores relational data that include time-related attributes (timestamps with different semantics). A time-series database stores sequences of values that change with time (stock exchange)
Data mining finds the characteristics of object evolution, trend of change for objects: e.g., stock exchange data can be mined to uncover trends in investment strategies
PDCAT 2002, T.B. Ho 16
Text and Multimedia Databases
Text databases contain documents, usually highly unstructured or semi-structured. To uncover general descriptions of object classes, keywords, content associations, clustering behavior of text objects, etc.
Multimedia databases store image, audio, and video data: picture content-based retrieval, voice-email systems, video-on-demand-systems, speech-based user interface, etc.
PDCAT 2002, T.B. Ho 17
The Web provides an enormous source of explicit and implicit knowledge that people can navigate and search for what they need.
Example: When examining the data collected from Internet Mart, heavily trodden paths gave BT hints to regions of the site which were of key interest to its visitors.
The World Wide Web
PDCAT 2002, T.B. Ho 18
KDD is inherentlyinteractive and iterative
a step in the KDD process consisting of methods that produce useful patterns or models from the data
1
3
4
5
Understand the domainand Define problems
Collect andPreprocess Data
Data MiningExtract Patterns/Models
Interpret and Evaluatediscovered knowledge
Putting the resultsin practical use
Maybe 70-90% of effort and cost in KDD
The KDD Process
2
PDCAT 2002, T.B. Ho 19
Data organized by function
Create/selecttarget database
Select samplingtechnique and
sample data
Supply missing values
Normalizevalues
Select DM task (s)
Transform todifferent
representation
Eliminatenoisy data
Transformvalues
Select DM method (s)
Create derivedattributes
Extract knowledge
Find importantattributes &value ranges
Test knowledge
Refine knowledge
Query & report generationAggregation & sequencesAdvanced methods
Data warehousing
1
2
3
4
5
The KDD Process
PDCAT 2002, T.B. Ho 20
Starting Points: Data or Mining?
Nature of Data
Flat data tablesRelational databaseTemporal & Spatial TransactionMultimedia dataTextWeb
Mining tasks and methods
Classification/PredictionDecision treesNeural networkRule inductionetc.
DescriptionAssociation analysisClusteringetc.
PDCAT 2002, T.B. Ho 21
Outline
Basic concepts of KDD
KDD techniques: classification, association, clustering, visualizationChallenges and trends in KDD
KDD and high performance computing
Case studies in medicine data mining
PDCAT 2002, T.B. Ho 22
Predictive mining tasks perform inference on the current data in order to make predictions
Descriptive mining tasks characterize the general properties of the data in the database
Primary task of KDD
PDCAT 2002, T.B. Ho 23
Patterns
ModelsA model is a global description of a data set, a high level population or large sample perspective
A pattern is a low level summary of a relationship, perhaps which holds only for a few records or for only a few variables (local)A pattern is seen as a statement S in a language L that describes a subset D(S) of a database D with a quality q(S)
A model tells us about correlation between variables (regression), about hierarchies of clusters (clustering), a neural network, etc.
IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15THEN Prediction = VIRUS [87,5%]
Discovery of Patterns and/or Models
PDCAT 2002, T.B. Ho 24
color #nuclei #tails class
H1 light 1 1 healthy
H2 dark 1 1 healthy
H3 light 1 2 healthy
H4 light 2 1 healthy
C1 dark 1 2 cancerous
C2 dark 2 1 cancerous
C3 light 2 2 cancerous
C4 dark 2 2 cancerous
Datasets: Cancerous and Healthy Cells
H1
C3
H3 H4
H2
C2C1
C4
PDCAT 2002, T.B. Ho 25
Classification/Prediction
Classification is the process of finding a set of models (or functions) that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown
Decision treesIF-THEN rulesNeural networksMathematical formulaeetc.
PDCAT 2002, T.B. Ho 26
ClassificationAlgorithms
If color = darkand # tails = 2
Then cancerous cell
H1
H3 H4
H2
C2C1
training data
Classifier(model)
Unknown case
Classification—A Two-Step Process
Cancerous?
Model construction Model usage
PDCAT 2002, T.B. Ho 27
Comparing Classification Methods
Predictive accuracy: the ability of the classifier to correctly predict unseen data
Speed: refers to computation cost
Robustness: the ability of the classifier to make correctly predictions given noisy data or data with missing values
Scalability: the ability to construct the classifier efficiently given large amounts of data
Interpretability: the level of understanding and insight that is provided by the classifier
PDCAT 2002, T.B. Ho 28
Mining with Decision Trees
#nuclei?
1 2
light dark
color?
light dark
1 2
#tails?H
H C
color?
#tails?
1 2
H C
C
H1
C3
H3 H4
H2
C2C1
C4
PDCAT 2002, T.B. Ho 29
General Algorithm for Tree Induction
1. Choose the “best” attribute by a given measure for attribute selection
2. Extend tree by adding new branch for each value of the attribute
3. Sorting training examples to leaf nodes
4. If examples in a node belong to one class Then Stop Else Repeat steps 1-4 for leaf nodes
5. Prune the tree to avoid over-fitting
Two steps: recursively generate the tree (1-4), and prune the tree (5)
PDCAT 2002, T.B. Ho 30
Measures for Attribute Selection
PDCAT 2002, T.B. Ho 31
Other Classification Methods
Neural NetworksInstance-based ClassificationGenetic AlgorithmsRough Set ApproachStatistical ApproachesSupport Vector Machinesetc.
PDCAT 2002, T.B. Ho 32
H1
C3
H3 H4
H2
C2C1
C4
Healthy
Cancerous
color = dark
# nuclei = 1
# tails = 2
Mining with Neural Networks
PDCAT 2002, T.B. Ho 33
Neural Networks
Advantagesprediction accuracy is generally highrobust, works when training examples contain errorsoutput may be discrete, real-valued, or a vector of several discrete or real-valued attributesfast evaluation of the learned target function
Criticismlong training timedifficult to understand the learned function (weights)not easy to incorporate domain knowledge
PDCAT 2002, T.B. Ho 34
Instance-based Classification
Instance-based classificationUsing most similarity individual instances known in the past to classify a new instance
Typical approachesk-nearest neighbor approach
Instances represented as points in a Euclidean space
Locally weighted regressionConstructs local approximation
Case-based reasoningUses symbolic representations and knowledge-based inference
PDCAT 2002, T.B. Ho 35
Genetics Algorithms (GA)
GA: based on an analogy to biological evolution
Each rule is represented by a string of bits
An initial population is created consisting of randomly generated rules
e.g., IF A1 and Not A2 then C2 can be encoded as 100
Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings
The fitness of a rule is represented by its classification accuracy on a set of training examples
Offsprings are generated by crossover and mutation
PDCAT 2002, T.B. Ho 36
Rough Set Approach
Rough sets are used to approximately or “roughly” define equivalent classes
A rough set for a given class C is approximated by two sets:
A lower approximation(certain to be in C)A upper approximation(possible to be in C)
Finding the minimal subsets (reducts) of attributes, dependencies in data, rules, etc.
X
Equivalence classes
Rough sets and Data Mining, T.Y. Lin, N. Cercone (eds.), Kluwer Academic Pub., 1997)
Rough Sets in Knowledge Discovery, L. Polkowski, A. Skowron (eds.), Physica-Verlag, 1998.
PDCAT 2002, T.B. Ho 37
Bayesian Classification
Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of problems
P(Ci|X) = probability that the instance X = <x1,…,xk> is of class Ci. Idea: assign to sample X the class label Ci such that P(Ci|X) is maximal
Bayesian theorem
Naïve assumption: attribute independence
Bayesian belief network allows a subset of the variables conditionally independent
P(X)))P(CC|P(X
X)|P(C iii =
PDCAT 2002, T.B. Ho 38
Market Basket Analysis
Analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets”
Helps develop marketing strategies by gaining insight into whichitems are frequently purchased together by customers
How often people buy onigiri and beer together?
PDCAT 2002, T.B. Ho 39
If color = lightand # nuclei = 1
Then # tails = 1(support = 12.5%; confidence = 50%)
If # nuclei = 2and cell = cancerous
Then # tails = 2(support = 25%;confidence = 100%)
H1
C3
H3 H4
H2
C2C1
C4
Mining with Association RulesAssociation: the presence of same color and # nuclei implies the presence of same # tails in the same record
Support: the proportion of times that the rule applies. Confidence: the proportion of times that the rule is correctApriori algorithm, R. Agrawal 1993
PDCAT 2002, T.B. Ho 40
Rule Measures: Support and Confidence
Example: Find all the rules X & Y ⇒ Z with minimum confidence and support
support s = probability that a transaction contains {X and Y and Z}confidence c = conditional probability that a transaction having {X and Y} also contains Z
If minimum support 50%, minimum confidence 50%:
A ⇒ C (s=50%, c=66.6%)C ⇒ A (s=50%, c=100%)
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Customer buys onigiri
Customer buys both Customerbuys beer
PDCAT 2002, T.B. Ho 41
Association Mining: Apriori Algorithm
It is composed of two steps:
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count
2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence
PDCAT 2002, T.B. Ho 42
Association Mining: Apriori Principle
For rule A ⇒ C:support = support({A and C}) = 50%confidence = support({A and C})/support({A}) = 66.6%
The Apriori principle:Any subset of a frequent itemset must be frequent
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%
Min. support 50%Min. confidence 50%
PDCAT 2002, T.B. Ho 43
The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation
1. Find the frequent itemsets: the sets of items that have minimum support
A subset of a frequent itemset must also be a frequent itemset
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset
Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
C1 … Li-1 Ci Li Ci+1 … Lk
2. Use the frequent itemsets to generate association rules.
PDCAT 2002, T.B. Ho 44
Example (min_sup_count = 2)
TID List of items_IDs
T100 I1, I2, I5T200 I2, I4T300 I2, I3T400 I1, I2, I4T500 I1, I3T600 I2, I3T700 I1, I3T800 I1, I2, I3, I5T900 I1, I2, I3
Itemset Sup.Count
{I1} 6 {I2} 7{I3} 6{I4} 2 {I5} 2
C1
Itemset Sup.Count
{I1} 6 {I2} 7{I3} 6{I4} 2 {I5} 2
L1
Transactional dataScan D for count of each candidate
Compare candidate support count with minimum support count
PDCAT 2002, T.B. Ho 45
Example (min_sup_count = 2)
Itemset{I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3}{I2, I4}{I2, I5}{I3, I4}{I3, I5}{I4, I5}
C2
Scan D for count of each candidate
Itemset S.count{I1, I2} 4 {I1, I3} 4 {I1, I4} 1 {I1, I5} 2 {I2, I3} 4{I2, I4} 2{I2, I5} 2{I3, I4} 0{I3, I5} 1{I4, I5} 0
C2Compare candidate support count with minimum support count
Itemset S.count{I1, I2} 4 {I1, I3} 4 {I1, I5} 2 {I2, I3} 4{I2, I4} 2{I2, I5} 2
L2
Generate C3 candidates from L2
Itemset
{I1, I2, I3} {I1, I2, I5}
Scan D for count of each candidate
Itemset Sc
{I1, I2, I3} 2 {I1, I2, I5} 2
C3
Compare candidate support count with minimum support count
Itemset Sc
{I1, I2, I3} 2 {I1, I2, I5} 2
L3
PDCAT 2002, T.B. Ho 46
Mining with Clustering
Clustering analyzes data objects without consulting a known class label.
The objects are clustered or grouped based on the principle of maximizing the intra-class and minimizing the inter-class similarity
Partition-based clustering for large sets of numerical data.
Hierarchical clustering with at least O(n2) time complexity seems not be suitable for very large datasets
PDCAT 2002, T.B. Ho 47
What is Cluster Analysis?A cluster is a collection of data objects satisfying
Objects in this cluster are similar to one another
Objects in this cluster are dissimilar to the objects in other clusters
The process of grouping objects into clusters is called clustering
PDCAT 2002, T.B. Ho 48
Clustering in Different Fields
Statistics: since many years, focus on distance-based clustering (S-Plus, SPSS, SAS)
Machine learning: unsupervised learning. In conceptual clustering, a group of objects forms a class only if it is described by a concept
KDD: Efficient and effective clustering of large databases: scalability, complex shapes and types of data, high dimensional clustering, mixed numerical and categorical data
PDCAT 2002, T.B. Ho 49
What is Good Clustering?
A good clustering method will produce high quality clusters with
high intra-class similarity (within a class)
low inter-class similarity (between classes)
The quality of clustering basically depends on the similarity measure and the cluster representative used by the method
PDCAT 2002, T.B. Ho 50
Typical Requirements of Clustering
ScalabilityAbility to deal with different types of attributesDiscovery of clusters with arbitrary shapeMinimal requirements for domain knowledge to determine input parametersAbility to deal with noisy dataInsensitivity to the order of input recordsHigh dimensionalityConstraint-based clusteringInterpretability and usability
PDCAT 2002, T.B. Ho 51
Clustering Methods in KDD
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Methods
PDCAT 2002, T.B. Ho 52
Partitioning Methods
Given n objects and k as number of clusters to form. A partitioning algorithm organizes the objects into a partition of k clusters
The clusters are formed to optimize an objective partitioning criterion so that the objects within a cluster are “similar” , whereas the objects of different classes are “dissimilar”
PDCAT 2002, T.B. Ho 53
K-means Algorithm (K=2)Two centersselected randomlyfrom nobjects
Form twoclusters byassigningeach object toits nearest center
Reformtwo new clusters
Calculatetwo newcenters
Calculatetwo newcenters
Repeatstep 2 and 3untilthe stoppingconditions hold
1 2
3 4
PDCAT 2002, T.B. Ho 54
Partitioning Methods
The k-means algorithm is sensitive to outliers
The k-medoids method uses medoid (the most centrally located object in a cluster)
The EM (Expectation Maximization) algorithm: assigns to a cluster according to a weight representing the probability of membership.
PAM (Partitioning Around Medoids)
From k-Medoids to CLARA (Clustering LARgeApplications)
From CLARA to CLARANS (Clustering LARgeApplications based on RANdomized Search)
PDCAT 2002, T.B. Ho 55
Hierarchical Methods
A hierarchical clustering is a sequence of partitions in which each partition is nested into the next (previous) partition in the sequence.
Partition Q is nested into partition P if every component of Q is a subset of a component of P.
{ }},,{},,{},,,,,{ 65382109741 xxxxxxxxxxP =
{ }},{},{},,{},,{},,,{ 63582107941 xxxxxxxxxxQ =
PDCAT 2002, T.B. Ho 56
Hierarchical Clustering: Chameleon
Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling. Clusters are merged if the interconnectivity and closeness between them are highly related to the internal interconnectivity and closeness of objects within the clusters.
PDCAT 2002, T.B. Ho 57
Density-based Methods
Typically regards clusters as dense regions of objects in the data space that are separated by regions of low density
DBSCAN: Based on Connected Regions with Sufficiently High Density (Nearest Neighbor Estimation)
DENCLUE: Based on Density Distribution Functions (Kernel Estimation)
DBScan result for DS2 with MinPts at 4 and Eps at (a) 5.0, (b) 3.5 and (c) 3.0
PDCAT 2002, T.B. Ho 58
Data and Knowledge Visualization
Sunday11-12 PM
Lunch time
Tree map
Cone tree
Fisheye view
Hyperbolic tree
MagicLens
PDCAT 2002, T.B. Ho 59
KDD Products and Tools
SPSS
IBM
Silicon Graphics SASSalford Systems
RuleQuest Research (C4.5)
PDCAT 2002, T.B. Ho 60
Outline
Basic concepts of KDD
KDD techniques: classification, association, clustering, visualization
Challenges and trends in KDD
KDD and high performance computing
Case studies in medicine data mining
PDCAT 2002, T.B. Ho 61
Challenges of KDD
Different types of data in different forms(mixed numeric, symbolic, text, image, voice,…)
Large data sets (106-1012 bytes) and high dimensionality (102-103 attributes)[Problems: efficiency, scalability?]
[Problems: quality, effectiveness?]
Data and knowledge are changing
Human-Computer Interaction and Visualization
PDCAT 2002, T.B. Ho 62
3 attributes each has 2 values: #instances = 23 = 8 #patterns =27
What if #attributes increases?
Size of instance space and pattern space increased exponentially
p attributes each has d values, size of instance space is dp
38 attributes each has 10 values: #instances = 1038
Large Datasets and High Dimensionality
H1
C3
H3 H4
H2
C2C1
C4
PDCAT 2002, T.B. Ho 63
Scalable and efficient algorithms (scalable: given an amount of main memory, its runtime increases linearly with the number of input instances)
Sampling (instance selection)
Dimensionality reduction (feature selection)
Approximation methods
Massively parallel processing
Integration of machine learning and database management
Possible Solutions
PDCAT 2002, T.B. Ho 64
Attribute Numerical Symbolic
No structure
≠= Places,Color
Ordinal structure
≥≠= Ring
structure
Rank,Resemblance
Age,Temperature,Taste,
Income,Length
Nominal(categorical)
Ordinal
Measurable
Numerical vs. Symbolic DataCombinatorial search in hypothesis spaces (machine learning)
Often matrix-based computation (multivariate data analysis)
×+≥≠=
PDCAT 2002, T.B. Ho 65
Issues of Decision Tree Mining
Attribute selection
Pruning trees
From trees to rules (high cost of pruning)
Visualization
Data access: recent development on very large training sets, fast, efficient and scalable (in-memory and secondary storage)
(well-known systems: C4.5 and CART)
PDCAT 2002, T.B. Ho 66
Scalable Decision Tree Induction Methods
SLIQ (Mehta et al., 1996)builds an index for each attribute and only class list and the current attribute list reside in memory
SPRINT (J. Shafer et al., 1996)constructs an attribute list data structure
PUBLIC (Rastogi & Shim, 1998)integrates tree splitting and tree pruning: stop growing the tree earlier
RainForest (Gehrke, Ramakrishnan & Ganti, 1998)separates the scalability aspects from the criteria that determine the quality of the treebuilds an AVC-list (attribute, value, class label)
PDCAT 2002, T.B. Ho 67
Effectively address the weakness of the symbolic AI approach in knowledge discovery (grow of the hypothesis space)
Extracting or making sense of numeric weights associated with the interconnections of neurons to come up with a higher level of knowledge has been and will continue to be a challenge problem
Issues of Neural Network Mining
PDCAT 2002, T.B. Ho 68
Improving the efficiencyDatabase scan reduction: partitioning (Savaseve 95), hashing (Park 95), sampling (Toivonen 96), dynamic itemset counting (Brin 97), find non-redundant rules (3000 times less, Zaki KDD’2000)
Parallel mining of association rules
New measures of associationInterestingness and exceptional rules
Generalized and multiple-level rules
Issues of Association Rule Mining
PDCAT 2002, T.B. Ho 69
Mining Scientific Data
Data Mining in Bioinformatics
Data Mining the Astronomy and Earth Sciences
Mining Physics and Chemistry data
Mining Large Image Databases
etc.
PDCAT 2002, T.B. Ho 70
Some Advanced Techniques
Support Vector Machines
Independent Component Analysis
Level Sets and Data Mining
Multi-Relational Data Mining and Logic Programming
Ensemble Methods
Distributed and High Performance Computing
etc.
PDCAT 2002, T.B. Ho 71
Outline
Basic concepts of KDD
KDD techniques: classification, association, clustering, visualization
Challenges and trends in KDD
KDD and high performance computing
Case studies in medicine data mining
PDCAT 2002, T.B. Ho 72
Scalable and efficient algorithms scalable: given an amount of main memory, its runtime increases linearly with the number of input instances
Massively parallel processingData-parallel vs. Control-parallel Data Mining
Client/Server Frameworks for Parallel Data Mining
Possible Solutions
Mining Very Large Databases With Parallel ProcessingAlex A. Freitas & Simon H. Lavington, Kluwer Academic Publishers, 1998
PDCAT 2002, T.B. Ho 73
Mixed Similarity Measures (MSM): Goodall (1966) time O(n3), Diday and Gowda (1992),
Ichino and Yaguchi (1994),
Li & Biswas (1997) Time O(n2logn2), Space O(n2):
New and Efficient MSM (Binh & Bao, 2000):
Time and Space O(n):
Example of a Scalable Algorithm:Mixed Similarity Measure
*ˆ 1ˆijij PP −=
ijP*ijP
PDCAT 2002, T.B. Ho 74
Comparative ResultsUS Census database 33 sym + 8 num attributes, Alpha 21264, 500 MHz, RAM 2 GB, Solaris OS (Nguyen N.B. & Ho T.B., PKDD 2000)
#cases 500 1.000 1.500 2.000 5.000 10.000 199.523 (0.2M) (0.5M) (0.9M) (1.1M) (2.6M) (5.2M) (102M)
# values 497 992 1.486 1.973 4.858 9.651 97.799
time of LiBis 67.3s 26m6.2 1h46m31s 6h59m45s >60h not app not app O(n2logn2)
Time of OURS 0.1s 0.2s 0.3s 0.5s 2.8s 9.2s 36m26sO(n)
Memory of LiBis 5.3M 20.0M 44.0M 77.0M 455.0M not app not app O(n2)Memory of OURS 0.5 M 0.7M 0.9M 1.1M 2.1M 3.4M 64.0MO(n)
Preprocessing 0.1s 0.1s 0.2s 0.5s 0.9s 6.2s 127.2s
PDCAT 2002, T.B. Ho 75
Approaches of High Performance Computing to Data Mining
approaches
Data-oriented
discretization
Attribute selection
Instance selection(sampling)
Fast algorithms
Distributed mining
Parallel mining
Single sampling
Iterative sampling
Restricted search
Algorithm optimization
Voting
Model integration
Meta-learning
Inter-processor cooperation
Inter-algorithm parallelization
Algorithm-oriented
Inter-algorithm parallelization
PDCAT 2002, T.B. Ho 76
Distributed & Parallel Data Mining
Data set to
be mined
Subset 1 Alg.
Combine
Know.
Subset P Alg.Know.
Know.
... ... ...
Data set to
be mined
Alg.
Combine
Know.
Alg.Know.
Know.... ...
Distributed System
Parallel System
PDCAT 2002, T.B. Ho 77
Parallel Data Mining
Rule inductionDecision treesNeural networksGenetic algorithmsRough setsAssociation rulesClusteringetc.
1. Parallel Data Mining without DBMS Facilities2. Parallel Data Mining with Database Facilities
newcase
storedcases
subset 1Local MIN
Processor 1Global MIN
local nearest case
storedcases
subset pLocal MIN
Processor p
local nearest case
nearest case
Exploiting data parallelism in instance-based learning
PDCAT 2002, T.B. Ho 78
Outline
Basic concepts of KDD
KDD techniques: classification, association, clustering, visualization
Challenges and trends in KDD
KDD and high performance computing
Case studies in medicine data mining
PDCAT 2002, T.B. Ho 79
Mining Stomach Cancer Data
Each year about 50,000 people die in Japan by stomach cancer. Expect to use data mining methods to find new/useful knowledge.
The project started in summer 1999, including three data mining groups, and doctors at National Cancer Center in Tokyo.
The stomach cancer database was collected during 40 years (1962-1991). Transformed data table contains data of 6,712 patients described by 83 numeric and categorical attributes.
PDCAT 2002, T.B. Ho 80
Overview of Our Data Mining Work
Understand the domainand Define problems
Preprocess Data
Data MiningExtract Patterns/Models
Interpret and Evaluatediscovered knowledge
Putting the resultsin practical use
- Use pre-operative data to predict the patient stage after the operation
- alive (3275), deathafter 5 years (575), death after 90 days (2552), deathwithin 90 days(302), unknown (8).
- Transform data: converting categorical many-value attributes(280) into binary attributes
- Construct the target attribute- Selection of 31 significant
attributes by KJ and SFG methods
- Learn decision trees by See5 and CABRO with treevisualization
- Learn prediction rules by CBA, Rosetta and ourmethod LUPC
- Meeting with medical experts every two months to evaluate the results
- Scores (1 – 5) are given to “Acceptability”, “Novelty” and “Utility” of discovered patterns
- Data mining and evaluation are off-line
1
3
2
4
5
PDCAT 2002, T.B. Ho 81
Learned Decision Trees with CABRO
Tightly-coupled views
T2.5D views (Trees 2.5 Dimensions)
Induced decision trees with graphical representation (easy to observe and interpret)
PDCAT 2002, T.B. Ho 82
Learned Rules and Expert Evaluation
1.2.3.4.5
1.2.3.4.5
1.2.3.4.5
1.2.3.4.5
1.2.3.4.5
1.2.3.4.5
1.2.3.4.5
1.2.3.4.5
1.2.3.4.5
IF dcancer = S AND serosal = 3 ANDeritoneal = 0 AND apnemia = 0 ANDTHEN death < 90days
IF dcancer = x AND type = B3 AND
peritoneal = 0 AND liver_metastasis = 3THEN death < 90days
IF sex = M AND age < 73 AND
liver_metastasis = 3 AND cardio = 1THEN death < 90days
UtilityNoveltyAcceptabilitySome discovered rules
Most rules found are Most rules found are not newnot new to medical expertsto medical expertsVery high false negative error in the (minority) target class
PDCAT 2002, T.B. Ho 83
User-centered Data Mining
Active participationof the user (domain experts) in the KDD process and model selection
Putting the visualization power in the KDD process
Putting domain knowledge in mining
PDCAT 2002, T.B. Ho 84
Visualization in the KDD Process
Synergistic visualization of data & knowledge into knowledge discovery context
Appropriate interactive visualizationtechniques in the knowledge discovery process
PDCAT 2002, T.B. Ho 85
Significant Hypothesis Detected by Visualization
Some instances in class “alive”are with metastasis = 3
PDCAT 2002, T.B. Ho 86
Putting Domain Knowledge in Mining
Exclusive constraints: If imposed, D2MS will find only rules that do not contain any of such constraints (attribute-value pairs) in their condition part.
Inclusive constraints: If imposed, D2MS find only rules each of them must contain at least one of such constraints (attribute-value pairs) in their condition part.
PDCAT 2002, T.B. Ho 87
Putting Domain Knowledge in Mining
Finding irregular rules
Find only rules for class “death within 90 days” that do not contain the characterized attribute “liver_metastasis”and/or its combination with two other typical attributes, “Peritoneal_metastasi”and “Serosal_invasion” by exclusive constraints.
Rule 8 acc = 1.0 (4/4), cover = 0.001 (4/6712)
IF category = R AND sex = F AND proximal_third = 3 AND middle_third = 1
THEN death within 90 days
Finding rare rules
Find rules in the class “alive”that contain the symptom “liver_metastasis” by inclusive constraints.
Rule 1 acc = 0.500 (2/4); cover = 0.001(4/6712)
IF sex = M AND type = B1 AND liver_metastasis = 3AND middle_third = 1
THEN class = alive
PDCAT 2002, T.B. Ho 88
Mining Hepatitis Data with Temporal Abstraction
The hepatitis relational database collected during 1982-2001 at the Chiba university hospital
Our process of mining hepatitis data with temporal abstraction goes through six steps
PDCAT 2002, T.B. Ho 89
Temporal Abstraction Problems & Data Analysis
Structure and problem of temporal abstractionStructure of basic temporal abstraction
<episode, state & trend>example: <ALB 3 months, low & decreasing>Problems: finding episodes, states, and trends.
For example, when visualizing the relation between GOT, GPT, TTT, ZTT and fibrosis stages of one patient during 1985-1993, we observed that the values of GOT, GPT, TTT, and ZTT decrease when fibrosis becomes less severe.
Analysis of data by statistics and visualization tools
PDCAT 2002, T.B. Ho 90
Abstracted Data and Primary ResultsFrom the relational and temporal database, we derived abstracted
descriptions and converted into symbolic data in the flat data tables.
Most rules for hepatitis B and C match from 2% to 5% of the database with high accuracy. The accuracy with 10-cross validation is somehow higher than 70%.
By using system D2MS we found different rules sets and decision trees for distinguishing hepatitis B and C, as well the fibrosis stages.
The patient in the first row has abstractions on “ALB 3 months” as “normal & decreasing-decreasing”
(N-DD), on “ALB 6 months” as “normal & decreasing-stable” (N-
DS), etc.
Abstracted data
Original data
Extracted rules
PDCAT 2002, T.B. Ho 91
Rules Contradict with Human’s Belief
Short term change: GOT (up), GPT (up), TTT (up), ZTT (up).
Long term change: T-CHO (down), CHE (down), ALB (down), TP (down), PLT (down), WBC (down), HGB (down), T-BIL (up), D-BIL (up), I-BIL (up), ICG-15 (up).
Many rules found are contradict with human’s belief
Rule 2 : accuracy = 1.0 (12/12); coverage = 0.028 (2/426)IF ALB2 = normal & decreasing-decreasing
GOT4 = normal & decreasing-decreasingTTT4 = normal & decreasing-decreasing
THEN class = fibrosis stage F1
PDCAT 2002, T.B. Ho 92
Rules Characterizing HBV and HCV
Example of a rule for hepatitis C
The rules show the difference in temporal patterns between HBV and HCV
PDCAT 2002, T.B. Ho 93
Rules Characterizing Fibrosis Stages
Example of a rule characterizing fibrosis stage F4.
The rules show the difference in temporal patterns between fibrosis stagesF0, F1, …, F4
PDCAT 2002, T.B. Ho 94
Summary
KDD concepts, methods, challenges, examples
KDD is a new, fast growing interdisciplinary field for both research and application
Speed up KDD algorithms is crucial
PDCAT 2002, T.B. Ho 95
Recommended References
http://www.kdnuggets.com
David J. Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining, MIT Press, 2000
Jiawei Han, Micheline Kamber, Data Mining : Concepts and Techniques, Morgan Kaufmann, 2000
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.