eecs 800 research seminar mining biological data
DESCRIPTION
EECS 800 Research Seminar Mining Biological Data. Instructor: Luke Huan Fall, 2006. Overview. Rule-based classification method overview CBA: classification based on association Applying rules based method in Microarray analysis. Rule Based Methods Vs. SVM. - PowerPoint PPT PresentationTRANSCRIPT
The UNIVERSITY of Kansas
EECS 800 Research SeminarMining Biological Data
Instructor: Luke Huan
Fall, 2006
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide2
10/25/2006Classification III
OverviewOverview
Rule-based classification method overview
CBA: classification based on association
Applying rules based method in Microarray analysis
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide3
10/25/2006Classification III
Rule Based Methods Vs. SVMRule Based Methods Vs. SVM
Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering Rule Groups for Gene Expression Data". SIGMOD'05, 2005.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide4
10/25/2006Classification III
Rule-Based ClassifierRule-Based Classifier
Classify records by using a collection of “if…then…” rules
Rule: (Condition) ywhere
Condition is a conjunctions of attributes
y is the class label
Examples of classification rules:
(Blood Type=Warm) (Lay Eggs=Yes) Birds
(Taxable Income < 50K) (Refund=Yes) Evade=No
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide5
10/25/2006Classification III
Rule-based Classifier (Example)Rule-based Classifier (Example)
R1: (Give Birth = no) (Can Fly = yes) BirdsR2: (Give Birth = no) (Live in Water = yes) FishesR3: (Give Birth = yes) (Blood Type = warm) MammalsR4: (Give Birth = no) (Can Fly = no) ReptilesR5: (Live in Water = sometimes) Amphibians
Name Blood Type Give Birth Can Fly Live in Water Classhuman warm yes no no mammalspython cold no no no reptilessalmon cold no no yes fisheswhale warm yes no yes mammalsfrog cold no no sometimes amphibianskomodo cold no no no reptilesbat warm yes yes no mammalspigeon warm no yes no birdscat warm yes no no mammalsleopard shark cold yes no yes fishesturtle cold no no sometimes reptilespenguin warm no no sometimes birdsporcupine warm yes no no mammalseel cold no no yes fishessalamander cold no no sometimes amphibiansgila monster cold no no no reptilesplatypus warm no no no mammalsowl warm no yes no birdsdolphin warm yes no yes mammalseagle warm no yes no birds
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide6
10/25/2006Classification III
Application of Rule-Based Classifier
Application of Rule-Based Classifier
A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule
R1: (Give Birth = no) (Can Fly = yes) Birds
R2: (Give Birth = no) (Live in Water = yes) Fishes
R3: (Give Birth = yes) (Blood Type = warm) Mammals
R4: (Give Birth = no) (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
Name Blood Type Give Birth Can Fly Live in Water Classhawk warm no yes no ?grizzly bear warm yes no no ?
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide7
10/25/2006Classification III
Rule Coverage and AccuracyRule Coverage and Accuracy
Coverage of a rule:Fraction of records that satisfy the antecedent of a rule
Accuracy of a rule:Fraction of records that satisfy both the antecedent and consequent of a rule
(Status=Single) No
Coverage = 40%, Accuracy = 50%
Tid home MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide8
10/25/2006Classification III
How does Rule-based Classifier Work?
How does Rule-based Classifier Work?
R1: (Give Birth = no) (Can Fly = yes) Birds
R2: (Give Birth = no) (Live in Water = yes) Fishes
R3: (Give Birth = yes) (Blood Type = warm) Mammals
R4: (Give Birth = no) (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
A lemur triggers rule R3, so it is classified as a mammal
A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
Name Blood Type Give Birth Can Fly Live in Water Classlemur warm yes no no ?turtle cold no no sometimes ?dogfish shark cold yes no yes ?
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide9
10/25/2006Classification III
Characteristics of Rule-Based Classifier
Characteristics of Rule-Based Classifier
Mutually exclusive rulesEvery record is covered by at most one rule
Exhaustive rulesClassifier has exhaustive coverage if it accounts for every possible combination of attribute values
Each record is covered by at least one rule
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide10
10/25/2006Classification III
From Decision Trees To RulesFrom Decision Trees To Rules
YESYESNONO
NONO
NONO
Yes No
{Married}{Single,
Divorced}
< 80K > 80K
Taxable Income
Marital Status
RefundClassification Rules
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced},Taxable Income<80K) ==> No
(Refund=No, Marital Status={Single,Divorced},Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the tree
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide11
10/25/2006Classification III
Rules Can Be SimplifiedRules Can Be Simplified
YESYESNONO
NONO
NONO
Yes No
{Married}{Single,
Divorced}
< 80K > 80K
Taxable Income
Marital Status
Refund
Initial Rule: (Refund=No) (Status=Married) No
Simplified Rule: (Status=Married) No
Tid home MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide12
10/25/2006Classification III
Effect of Rule SimplificationEffect of Rule Simplification
Rules are no longer mutually exclusiveA record may trigger more than one rule
Solution?
Ordered rule set
Unordered rule set – use voting schemes
Rules are no longer exhaustiveA record may not trigger any rules
Solution?
Use a default class
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide13
10/25/2006Classification III
Ordered Rule SetOrdered Rule Set
Rules are rank ordered according to their priorityAn ordered rule set is known as a decision list
When a test record is presented to the classifier It is assigned to the class label of the highest ranked rule it has triggered
If none of the rules fired, it is assigned to the default class
R1: (Give Birth = no) (Can Fly = yes) Birds
R2: (Give Birth = no) (Live in Water = yes) Fishes
R3: (Give Birth = yes) (Blood Type = warm) Mammals
R4: (Give Birth = no) (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
Name Blood Type Give Birth Can Fly Live in Water Classturtle cold no no sometimes ?
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide14
10/25/2006Classification III
Rule Ordering SchemesRule Ordering Schemes
Rule-based orderingIndividual rules are ranked based on their quality
Class-based orderingRules that belong to the same class appear together
Rule-based Ordering
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced},Taxable Income<80K) ==> No
(Refund=No, Marital Status={Single,Divorced},Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No
Class-based Ordering
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced},Taxable Income<80K) ==> No
(Refund=No, Marital Status={Married}) ==> No
(Refund=No, Marital Status={Single,Divorced},Taxable Income>80K) ==> Yes
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide15
10/25/2006Classification III
Building Classification RulesBuilding Classification Rules
Direct Method: Extract rules directly from data
e.g.: CBA
Indirect Method: Extract rules from other classification models (e.g. decision trees, neural networks, etc).
e.g: C4.5rules
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide16
10/25/2006Classification III
Direct Method: Sequential Covering
Direct Method: Sequential Covering
1. Start from an empty rule
2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion is met
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide17
10/25/2006Classification III
Example of Sequential CoveringExample of Sequential Covering
(i) Original Data (ii) Step 1
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide18
10/25/2006Classification III
Example of Sequential Covering…
Example of Sequential Covering…
(iii) Step 2
R1
(iv) Step 3
R1
R2
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide19
10/25/2006Classification III
Aspects of Sequential CoveringAspects of Sequential Covering
Rule Growing
Rule Evaluation
Stopping Criterion
Rule Pruning
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide20
10/25/2006Classification III
Rule GrowingRule Growing
Two common strategies
Status =Single
Status =Divorced
Status =Married
Income> 80K...
Yes: 3No: 4{ }
Yes: 0No: 3
Refund=No
Yes: 3No: 4
Yes: 2No: 1
Yes: 1No: 0
Yes: 3No: 1
(a) General-to-specific
Refund=No,Status=Single,Income=85K(Class=Yes)
Refund=No,Status=Single,Income=90K(Class=Yes)
Refund=No,Status = Single(Class = Yes)
(b) Specific-to-general
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide21
10/25/2006Classification III
Rule Growing (Examples)Rule Growing (Examples)
RIPPER Algorithm:Start from an empty rule: {} => classAdd conjuncts that maximizes FOIL’s information gain measure:
R0: {} => class (initial rule) R1: {A} => class (rule after adding conjunct) Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ] where t: number of positive instances covered by both R0 and R1p0: number of positive instances covered by R0n0: number of negative instances covered by R0p1: number of positive instances covered by R1n1: number of negative instances covered by R1
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide22
10/25/2006Classification III
Rule EvaluationRule Evaluation
Metrics:Accuracy
Laplace
M-estimate
kn
nc
1
kn
kpnc
n : Number of instances covered by rule
nc : Number of instances covered by rule
k : Number of classes
p : Prior probability
n
nc
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide23
10/25/2006Classification III
Stopping Criterion and Rule Pruning
Stopping Criterion and Rule Pruning
Stopping criterionCompute the gainIf gain is not significant, discard the new rule
Rule PruningSimilar to post-pruning of decision treesReduced Error Pruning:
Remove one of the conjuncts in the rule Compare error rate on validation set before and after pruning If error improves, prune the conjunct
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide24
10/25/2006Classification III
Summary of Direct MethodSummary of Direct Method
Grow a single rule
Remove Instances from rule
Prune the rule (if necessary)
Add rule to Current Rule Set
Repeat
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide25
10/25/2006Classification III
Direct Method: RIPPERDirect Method: RIPPER
For 2-class problem, choose one of the classes as positive class, and the other as negative class
Learn rules for positive class
Negative class will be default class
For multi-class problemOrder the classes according to increasing class prevalence (fraction of instances that belong to a particular class)
Learn the rule set for smallest class first, treat the rest as negative class
Repeat with next smallest class as positive class
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide26
10/25/2006Classification III
Indirect MethodsIndirect Methods
Rule Set
r1: (P=No,Q=No) ==> -r2: (P=No,Q=Yes) ==> +r3: (P=Yes,R=No) ==> +r4: (P=Yes,R=Yes,Q=No) ==> -r5: (P=Yes,R=Yes,Q=Yes) ==> +
P
Q R
Q- + +
- +
No No
No
Yes Yes
Yes
No Yes
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide27
10/25/2006Classification III
Indirect Method: C4.5rulesIndirect Method: C4.5rules
Extract rules from an unpruned decision treeFor each rule, r: A y,
consider an alternative rule r’: A’ y where A’ is obtained by removing one of the conjuncts in ACompare the pessimistic error rate for r against all r’sPrune if one of the r’s has lower pessimistic error rateRepeat until we can no longer improve generalization error
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide28
10/25/2006Classification III
Advantages of Rule-Based Classifiers
Advantages of Rule-Based Classifiers
As highly expressive as decision trees
Easy to interpret
Easy to generate
Can classify new instances rapidly
Performance comparable to decision trees
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide29
10/25/2006Classification III
Overview of CBAOverview of CBA
Classification rule mining versus Association rule miningAim
A small set of rules as classifier
All rules according to minsup and minconf
Syntax
X y
X Y
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide30
10/25/2006Classification III
Association Rules for Classification
Association Rules for Classification
Classification: mine a small set of rules existing in the data to form a classifier or predictor.
It has a target attribute: Class attribute
Association rules: have no fixed target, but we can fix a target.Class association rules (CAR): has a target class attribute. E.g.,
Own_house = true Class =Yes [sup=6/15, conf=6/6]CARs can obviously be used for classification.
B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. KDD, 1998http://www.comp.nus.edu.sg/~dm2/
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide31
10/25/2006Classification III
Decision tree vs. CARsDecision tree vs. CARs
The decision tree below generates the following 3 rules.Own_house = true Class =Yes [sup=6/15, conf=6/6]Own_house = false, Has_job = true Class=Yes [sup=5/15, conf=5/5]Own_house = false, Has_job = false Class=No [sup=4/15, conf=4/4]
But there are many other rules that are not found by the decision tree
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide32
10/25/2006Classification III
There are many more rulesThere are many more rules
CAR mining finds all of them. In many cases, rules not in the decision tree (or a rule list) may perform classification better. Such rules may also be actionable in practice
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide33
10/25/2006Classification III
Decision tree vs. CARs (cont …)Decision tree vs. CARs (cont …)
Association mining require discrete attributes. Decision tree learning uses both discrete and continuous attributes.
CAR mining requires continuous attributes discretized. There are several such algorithms.
Decision tree is not constrained by minsup or minconf, and thus is able to find rules with very low support. Of course, such rules may be pruned due to the possible overfitting.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide34
10/25/2006Classification III
CBA: Three StepsCBA: Three Steps
Discretize continuous attributes, if any
Generate all class association rules (CARs)
Building a classifier based on the generated CARs.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide35
10/25/2006Classification III
RG: The AlgorithmRG: The Algorithm
Find the complete set of all possible rules
Usually takes long time to finish
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide36
10/25/2006Classification III
RG: Basic ConceptsRG: Basic Concepts
Frequent ruleitemsA ruleitem is frequent if its support is above minsup
Accurate rule A rule is accurate if its confidence is above minconf
Possible ruleFor all ruleitems that have the same condset, the ruleitem with the highest confidence is the possible rule of this set of ruleitems.
The set of class association rules (CARs) consists of all the possible rules (PRs) that are both frequent and accurate.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide37
10/25/2006Classification III
Further Considerations in CAR mining
Further Considerations in CAR mining
Multiple minimum class supports Deal with imbalanced class distribution, e.g., some class is rare, 98% negative and 2% positive.We can set the minsup(positive) = 0.2% and minsup(negative) = 2%. If we are not interested in classification of negative class, we may not want to generate rules for negative class. We can set minsup(negative)=100% or more.
Rule pruning may be performed.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide38
10/25/2006Classification III
Building ClassifiersBuilding Classifiers
There are many ways to build classifiers using CARs. Several existing systems available.
Simplest: After CARs are mined, do nothing. For each test case, we simply choose the most confident rule that covers the test case to classify it. Microsoft SQL Server has a similar method.
Or, using a combination of rules.
Another method (used in the CBA system) is similar to sequential covering.
Choose a set of rules to cover the training data.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide39
10/25/2006Classification III
Class Builder: Three StepsClass Builder: Three Steps
The basic idea is to choose a set of high precedence rules in R to cover D.
Sort the set of generated rules RSelect rules for the classifier from R following the sorted sequence and put in C.
Each selected rule has to correctly classify at least one additional case.Also select default class and compute errors.
Discard those rules in C that don’t improve the accuracy of the classifier.
Locate the rule with the lowest error rate and discard the rest rules in the sequence.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide40
10/25/2006Classification III
Rules are sorted firstRules are sorted first
Definition: Given two rules, ri and rj, ri rj (also called ri precedes rj or ri has a higher precedence than rj) if
the confidence of ri is greater than that of rj, or
their confidences are the same, but the support of ri is greater than that of rj, or
both the confidences and supports of ri and rj are the same, but ri is generated earlier than rj.
A CBA classifier L is of the form:
L = <r1, r2, …, rk, default-class>
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide41
10/25/2006Classification III
Classifier building using CARsClassifier building using CARs
Selection:Each rule does at least one correct predictionEach case is covered by the rule with highest precedence
This algorithm is correct but is not efficient
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide42
10/25/2006Classification III
Classifier building using CARsClassifier building using CARs
For each case d in DcoverRulesd = all covering rules of d
Sort D according to the precedence of the first correctly predicting rule of each case d
RuleSet = empty
Scan D again to find optimal rule set.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide43
10/25/2006Classification III
Refined Classification Based on TopkRGS (RCBT)
Refined Classification Based on TopkRGS (RCBT)
General Idea:Construct RCBT classifier from top-k covering rule groups. So the number of the rule groups generated are bounded.
Efficiency and accuracy are validated by experimental results
Based on Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering Rule Groups for Gene Expression Data". SIGMOD'05
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide44
10/25/2006Classification III
DatasetDataset
In the microarray dataset:Each row in the dataset corresponds to a sample
Each item value in the dataset corresponds to a distretized gene expression value.
Class labels correspond to category of sample, (cancer / not cancer)
Useful in diagnostic purpose
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide45
10/25/2006Classification III
Introduction of gene expression data (Microarray)
Introduction of gene expression data (Microarray)
Format of gene expression data:Column – gene: thousands of Row — sample (class): tens of or hundreds of patients
gene1 gene2 gene3 gene4 Class
row1 10.6 1.9 33 22 C1
row2 18.6 5.6 56 10 C1
row3 5.7 4.3 133 22 C2
items Class
row1 a, b, e ,h C1
row2 c, d, e, f C1
row3 a, b, g, h C2
discretize
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide46
10/25/2006Classification III
Rule ExampleRule Example
Rule r: {a,e,h} -> C
Support(r)=3
Condifence(r)=66%
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide47
10/25/2006Classification III
RG: General solutionRG: General solution
Step1: Find all frequently occurred itemsets from dataset D.
Step2: Generate rule in the form of itemset -> C. Prune rules that do not have enough support an confidence.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide48
10/25/2006Classification III
RG: Previous Algorithms
Item enumeration: Search all the frequent itemsets by checking all possible combinations of items.
{ }
{a } {b } {c }
{ab } {ac } {bc }
{abc }
We can simulate the search process in an item enumeration tree.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide49
10/25/2006Classification III
Microarray dataMicroarray data
Features of Microarray dataA few rows: 100-1000
A large number of items, 10000
The space of all the combinations of items is large 210000.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide50
10/25/2006Classification III
MotivationsMotivations
Very slow for existing rule mining algorithmsItem search space is exponential to the number of item
use the idea of row enumeration to design new algorithm
The number of association rules are too huge even for a given consequent
mine top-k Interesting rule groups for each row
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide51
10/25/2006Classification III
DefinitionsDefinitions
Row support set: Given a set of items I’, we denote R(I’) as the largest set of rows that contain I.
Item support set: Given a set of rows R’, we denote I(R’) as the largest set of items that are common among rows in R’.
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide52
10/25/2006Classification III
Example
I’={a,e,h}, then R(I’)={r2,r3,r4}
R’={r2,r3}, then I(R’)={a,e,h}
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide53
10/25/2006Classification III
Rule GroupsRule Groups
What is rule group?Given a one row dataset: {a, b, c, d, e, Cancer}, 31 rules in the form of LHS Cancer.
the same row and the same confidence (100%).
1 upper bound and 5 lower bound
Rule group: a set of association rules whose LHS itemsets occurs in a same set of rows.Rule group has a unique upper bound. abcde Cancer
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide54
10/25/2006Classification III
Rule groups: exampleRule groups: example
Rule groups: a set of rules covered by the same rows.
upper bound rule
c-> C1 is not in the group
abc->C1(100%)
ab->C1(100%)
a->C1(100%) b->C1(100%)
ac->C1(100%) bc->C1(100%)
upper bound rule
class Items
C1 a,b,c
C1 a,b,c,d
C1 c,d,e
C2 c,d,e
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide55
10/25/2006Classification III
Significant of RGSSignificant of RGS
Rule group r1 is more significant than r2 ifr1.conf>r2.conf or
r1.sup>r2.sup and r1.conf=r2.conf
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide56
10/25/2006Classification III
Finding top-k rule groupsFinding top-k rule groups
Given dataset D, for each row of the dataset, find the k most significant covering rule groups (represented by the upper bounds) subject to the minimum support constraint
No min-confidence required
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide57
10/25/2006Classification III
Top-k covering rule groupsTop-k covering rule groups
For each row, we find the most significant k rule groups:
based on confidence first then support
Given minsup=1, Top-1row 1: abcC1(sup = 2, conf= 100%)row 2: abcC1
abcdC1(sup=1,conf = 100%)row 3: cdC1(sup=2, conf = 66.7%)row 4: cdeC2 (sup=1, conf = 50%)
class Items
C1 a,b,c
C1 a,b,c,d
C1 c,d,e
C2 c,d,e
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide58
10/25/2006Classification III
Relationship with CBARelationship with CBA
The rules selected by CBA for classification are a subset of the rules of TopkRGS with k=1
So we can use top-1 covering rule groups to build the CBA classifier
The authors also proposed a refined classification method based on TopkRGS
Reduces the chance that test data is classified by a default class;
Uses a subset of rules to make a collective decision
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide59
10/25/2006Classification III
Main advantages of Top-k coverage rule group
Main advantages of Top-k coverage rule group
The number is bounded by the product of k and the number of samples
Treat each sample equally provide a complete description for each row (small)
The minimum confidence parameter-- instead k.
Sufficient to build classifiers while avoiding excessive computation
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide60
10/25/2006Classification III
Naïve method of finding TopkRGSNaïve method of finding TopkRGS
Find the complete set of upper bound rules in the dataset by the row-wise algorithmPick the top-k covering rule groups for each row in the datasetInefficient -- To improve, keep track of the top-k rule groups at each enumeration node dynamically and use effective pruning strategies
Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide61
10/25/2006Classification III
ReferenceReference
W. Cohen. Fast eective rule induction. In ICML'95,
Xiaoxing Yin & Jiawei Han CPAR: Classification based on Predictive Association Rules,SDM’03
B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In KDD'98
Jiuyong Li, On Optimal Rule Discovery, IEEE Transactions on Knowledge and Data Engineering, Volume 18 , Issue 4, 2006
Some slides are offered by Zhang Xiang at http://www.cs.unc.edu/~xiang/