eecs 800 research seminar mining biological data

The UNIVERSITY of Kansas

EECS 800 Research SeminarMining Biological Data

Instructor: Luke Huan

Fall, 2006

Mining Biological DataKU EECS 800, Luke Huan, Fall’06 slide2

10/25/2006Classification III

OverviewOverview

Rule-based classification method overview

CBA: classification based on association

Applying rules based method in Microarray analysis



Rule Based Methods Vs. SVMRule Based Methods Vs. SVM

Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering Rule Groups for Gene Expression Data". SIGMOD'05, 2005.



Rule-Based ClassifierRule-Based Classifier

Classify records by using a collection of “if…then…” rules

Rule: (Condition) ywhere

Condition is a conjunctions of attributes

y is the class label

Examples of classification rules:

(Blood Type=Warm) (Lay Eggs=Yes) Birds

(Taxable Income < 50K) (Refund=Yes) Evade=No



Rule-based Classifier (Example)Rule-based Classifier (Example)

R1: (Give Birth = no) (Can Fly = yes) BirdsR2: (Give Birth = no) (Live in Water = yes) FishesR3: (Give Birth = yes) (Blood Type = warm) MammalsR4: (Give Birth = no) (Can Fly = no) ReptilesR5: (Live in Water = sometimes) Amphibians

Name Blood Type Give Birth Can Fly Live in Water Classhuman warm yes no no mammalspython cold no no no reptilessalmon cold no no yes fisheswhale warm yes no yes mammalsfrog cold no no sometimes amphibianskomodo cold no no no reptilesbat warm yes yes no mammalspigeon warm no yes no birdscat warm yes no no mammalsleopard shark cold yes no yes fishesturtle cold no no sometimes reptilespenguin warm no no sometimes birdsporcupine warm yes no no mammalseel cold no no yes fishessalamander cold no no sometimes amphibiansgila monster cold no no no reptilesplatypus warm no no no mammalsowl warm no yes no birdsdolphin warm yes no yes mammalseagle warm no yes no birds



Application of Rule-Based Classifier

Application of Rule-Based Classifier

A rule r covers an instance x if the attributes of the instance satisfy the condition of the rule

R1: (Give Birth = no) (Can Fly = yes) Birds

R2: (Give Birth = no) (Live in Water = yes) Fishes

R3: (Give Birth = yes) (Blood Type = warm) Mammals

R4: (Give Birth = no) (Can Fly = no) Reptiles

R5: (Live in Water = sometimes) Amphibians

The rule R1 covers a hawk => Bird

The rule R3 covers the grizzly bear => Mammal

Name Blood Type Give Birth Can Fly Live in Water Classhawk warm no yes no ?grizzly bear warm yes no no ?



Rule Coverage and AccuracyRule Coverage and Accuracy

Coverage of a rule:Fraction of records that satisfy the antecedent of a rule

Accuracy of a rule:Fraction of records that satisfy both the antecedent and consequent of a rule

(Status=Single) No

Coverage = 40%, Accuracy = 50%

Tid home MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10



How does Rule-based Classifier Work?

How does Rule-based Classifier Work?






A lemur triggers rule R3, so it is classified as a mammal

A turtle triggers both R4 and R5

A dogfish shark triggers none of the rules

Name Blood Type Give Birth Can Fly Live in Water Classlemur warm yes no no ?turtle cold no no sometimes ?dogfish shark cold yes no yes ?



Characteristics of Rule-Based Classifier

Characteristics of Rule-Based Classifier

Mutually exclusive rulesEvery record is covered by at most one rule

Exhaustive rulesClassifier has exhaustive coverage if it accounts for every possible combination of attribute values

Each record is covered by at least one rule



From Decision Trees To RulesFrom Decision Trees To Rules

YESYESNONO

NONO

NONO

Yes No

{Married}{Single,

Divorced}

< 80K > 80K

Taxable Income

Marital Status

RefundClassification Rules

(Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced},Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced},Taxable Income>80K) ==> Yes

(Refund=No, Marital Status={Married}) ==> No

Rules are mutually exclusive and exhaustive

Rule set contains as much information as the tree



Rules Can Be SimplifiedRules Can Be Simplified

YESYESNONO

NONO

NONO

Yes No

{Married}{Single,

Divorced}

< 80K > 80K

Taxable Income

Marital Status

Refund

Initial Rule: (Refund=No) (Status=Married) No

Simplified Rule: (Status=Married) No

Tid home MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10



Effect of Rule SimplificationEffect of Rule Simplification

Rules are no longer mutually exclusiveA record may trigger more than one rule

Solution?

Ordered rule set

Unordered rule set – use voting schemes

Rules are no longer exhaustiveA record may not trigger any rules

Solution?

Use a default class



Ordered Rule SetOrdered Rule Set

Rules are rank ordered according to their priorityAn ordered rule set is known as a decision list

When a test record is presented to the classifier It is assigned to the class label of the highest ranked rule it has triggered

If none of the rules fired, it is assigned to the default class






Name Blood Type Give Birth Can Fly Live in Water Classturtle cold no no sometimes ?



Rule Ordering SchemesRule Ordering Schemes

Rule-based orderingIndividual rules are ranked based on their quality

Class-based orderingRules that belong to the same class appear together

Rule-based Ordering

(Refund=Yes) ==> No




Class-based Ordering

(Refund=Yes) ==> No






Building Classification RulesBuilding Classification Rules

Direct Method: Extract rules directly from data

e.g.: CBA

Indirect Method: Extract rules from other classification models (e.g. decision trees, neural networks, etc).

e.g: C4.5rules



Direct Method: Sequential Covering

Direct Method: Sequential Covering

1. Start from an empty rule

2. Grow a rule using the Learn-One-Rule function

3. Remove training records covered by the rule

4. Repeat Step (2) and (3) until stopping criterion is met



Example of Sequential CoveringExample of Sequential Covering

(i) Original Data (ii) Step 1



Example of Sequential Covering…

Example of Sequential Covering…

(iii) Step 2

R1

(iv) Step 3

R1

R2



Aspects of Sequential CoveringAspects of Sequential Covering

Rule Growing

Rule Evaluation

Stopping Criterion

Rule Pruning



Rule GrowingRule Growing

Two common strategies

Status =Single

Status =Divorced

Status =Married

Income> 80K...

Yes: 3No: 4{ }

Yes: 0No: 3

Refund=No

Yes: 3No: 4

Yes: 2No: 1

Yes: 1No: 0

Yes: 3No: 1

(a) General-to-specific

Refund=No,Status=Single,Income=85K(Class=Yes)

Refund=No,Status=Single,Income=90K(Class=Yes)

Refund=No,Status = Single(Class = Yes)

(b) Specific-to-general



Rule Growing (Examples)Rule Growing (Examples)

RIPPER Algorithm:Start from an empty rule: {} => classAdd conjuncts that maximizes FOIL’s information gain measure:

R0: {} => class (initial rule) R1: {A} => class (rule after adding conjunct) Gain(R0, R1) = t [ log (p1/(p1+n1)) – log (p0/(p0 + n0)) ] where t: number of positive instances covered by both R0 and R1p0: number of positive instances covered by R0n0: number of negative instances covered by R0p1: number of positive instances covered by R1n1: number of negative instances covered by R1



Rule EvaluationRule Evaluation

Metrics:Accuracy

Laplace

M-estimate

kn

nc

1

kn

kpnc

n : Number of instances covered by rule

nc : Number of instances covered by rule

k : Number of classes

p : Prior probability

n

nc



Stopping Criterion and Rule Pruning

Stopping Criterion and Rule Pruning

Stopping criterionCompute the gainIf gain is not significant, discard the new rule

Rule PruningSimilar to post-pruning of decision treesReduced Error Pruning:

Remove one of the conjuncts in the rule Compare error rate on validation set before and after pruning If error improves, prune the conjunct



Summary of Direct MethodSummary of Direct Method

Grow a single rule

Remove Instances from rule

Prune the rule (if necessary)

Add rule to Current Rule Set

Repeat



Direct Method: RIPPERDirect Method: RIPPER

For 2-class problem, choose one of the classes as positive class, and the other as negative class

Learn rules for positive class

Negative class will be default class

For multi-class problemOrder the classes according to increasing class prevalence (fraction of instances that belong to a particular class)

Learn the rule set for smallest class first, treat the rest as negative class

Repeat with next smallest class as positive class



Indirect MethodsIndirect Methods

Rule Set

r1: (P=No,Q=No) ==> -r2: (P=No,Q=Yes) ==> +r3: (P=Yes,R=No) ==> +r4: (P=Yes,R=Yes,Q=No) ==> -r5: (P=Yes,R=Yes,Q=Yes) ==> +

P

Q R

Q- + +

- +

No No

No

Yes Yes

Yes

No Yes



Indirect Method: C4.5rulesIndirect Method: C4.5rules

Extract rules from an unpruned decision treeFor each rule, r: A y,

consider an alternative rule r’: A’ y where A’ is obtained by removing one of the conjuncts in ACompare the pessimistic error rate for r against all r’sPrune if one of the r’s has lower pessimistic error rateRepeat until we can no longer improve generalization error



Advantages of Rule-Based Classifiers

Advantages of Rule-Based Classifiers

As highly expressive as decision trees

Easy to interpret

Easy to generate

Can classify new instances rapidly

Performance comparable to decision trees



Overview of CBAOverview of CBA

Classification rule mining versus Association rule miningAim

A small set of rules as classifier

All rules according to minsup and minconf

Syntax

X y

X Y



Association Rules for Classification

Association Rules for Classification

Classification: mine a small set of rules existing in the data to form a classifier or predictor.

It has a target attribute: Class attribute

Association rules: have no fixed target, but we can fix a target.Class association rules (CAR): has a target class attribute. E.g.,

Own_house = true Class =Yes [sup=6/15, conf=6/6]CARs can obviously be used for classification.

B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. KDD, 1998http://www.comp.nus.edu.sg/~dm2/



Decision tree vs. CARsDecision tree vs. CARs

The decision tree below generates the following 3 rules.Own_house = true Class =Yes [sup=6/15, conf=6/6]Own_house = false, Has_job = true Class=Yes [sup=5/15, conf=5/5]Own_house = false, Has_job = false Class=No [sup=4/15, conf=4/4]

But there are many other rules that are not found by the decision tree



There are many more rulesThere are many more rules

CAR mining finds all of them. In many cases, rules not in the decision tree (or a rule list) may perform classification better. Such rules may also be actionable in practice



Decision tree vs. CARs (cont …)Decision tree vs. CARs (cont …)

Association mining require discrete attributes. Decision tree learning uses both discrete and continuous attributes.

CAR mining requires continuous attributes discretized. There are several such algorithms.

Decision tree is not constrained by minsup or minconf, and thus is able to find rules with very low support. Of course, such rules may be pruned due to the possible overfitting.



CBA: Three StepsCBA: Three Steps

Discretize continuous attributes, if any

Generate all class association rules (CARs)

Building a classifier based on the generated CARs.



RG: The AlgorithmRG: The Algorithm

Find the complete set of all possible rules

Usually takes long time to finish



RG: Basic ConceptsRG: Basic Concepts

Frequent ruleitemsA ruleitem is frequent if its support is above minsup

Accurate rule A rule is accurate if its confidence is above minconf

Possible ruleFor all ruleitems that have the same condset, the ruleitem with the highest confidence is the possible rule of this set of ruleitems.

The set of class association rules (CARs) consists of all the possible rules (PRs) that are both frequent and accurate.



Further Considerations in CAR mining

Further Considerations in CAR mining

Multiple minimum class supports Deal with imbalanced class distribution, e.g., some class is rare, 98% negative and 2% positive.We can set the minsup(positive) = 0.2% and minsup(negative) = 2%. If we are not interested in classification of negative class, we may not want to generate rules for negative class. We can set minsup(negative)=100% or more.

Rule pruning may be performed.



Building ClassifiersBuilding Classifiers

There are many ways to build classifiers using CARs. Several existing systems available.

Simplest: After CARs are mined, do nothing. For each test case, we simply choose the most confident rule that covers the test case to classify it. Microsoft SQL Server has a similar method.

Or, using a combination of rules.

Another method (used in the CBA system) is similar to sequential covering.

Choose a set of rules to cover the training data.



Class Builder: Three StepsClass Builder: Three Steps

The basic idea is to choose a set of high precedence rules in R to cover D.

Sort the set of generated rules RSelect rules for the classifier from R following the sorted sequence and put in C.

Each selected rule has to correctly classify at least one additional case.Also select default class and compute errors.

Discard those rules in C that don’t improve the accuracy of the classifier.

Locate the rule with the lowest error rate and discard the rest rules in the sequence.



Rules are sorted firstRules are sorted first

Definition: Given two rules, ri and rj, ri rj (also called ri precedes rj or ri has a higher precedence than rj) if

the confidence of ri is greater than that of rj, or

their confidences are the same, but the support of ri is greater than that of rj, or

both the confidences and supports of ri and rj are the same, but ri is generated earlier than rj.

A CBA classifier L is of the form:

L = <r1, r2, …, rk, default-class>



Classifier building using CARsClassifier building using CARs

Selection:Each rule does at least one correct predictionEach case is covered by the rule with highest precedence

This algorithm is correct but is not efficient



Classifier building using CARsClassifier building using CARs

For each case d in DcoverRulesd = all covering rules of d

Sort D according to the precedence of the first correctly predicting rule of each case d

RuleSet = empty

Scan D again to find optimal rule set.



Refined Classification Based on TopkRGS (RCBT)

Refined Classification Based on TopkRGS (RCBT)

General Idea:Construct RCBT classifier from top-k covering rule groups. So the number of the rule groups generated are bounded.

Efficiency and accuracy are validated by experimental results

Based on Gao Cong, Kian-Lee Tan, Anthony K. H. Tung, Xin Xu. "Mining Top-k Covering Rule Groups for Gene Expression Data". SIGMOD'05



DatasetDataset

In the microarray dataset:Each row in the dataset corresponds to a sample

Each item value in the dataset corresponds to a distretized gene expression value.

Class labels correspond to category of sample, (cancer / not cancer)

Useful in diagnostic purpose



Introduction of gene expression data (Microarray)

Introduction of gene expression data (Microarray)

Format of gene expression data:Column – gene: thousands of Row — sample (class): tens of or hundreds of patients

gene1 gene2 gene3 gene4 Class

row1 10.6 1.9 33 22 C1

row2 18.6 5.6 56 10 C1

row3 5.7 4.3 133 22 C2

items Class

row1 a, b, e ,h C1

row2 c, d, e, f C1

row3 a, b, g, h C2

discretize



Rule ExampleRule Example

Rule r: {a,e,h} -> C

Support(r)=3

Condifence(r)=66%



RG: General solutionRG: General solution

Step1: Find all frequently occurred itemsets from dataset D.

Step2: Generate rule in the form of itemset -> C. Prune rules that do not have enough support an confidence.



RG: Previous Algorithms

Item enumeration: Search all the frequent itemsets by checking all possible combinations of items.

{ }

{a } {b } {c }

{ab } {ac } {bc }

{abc }

We can simulate the search process in an item enumeration tree.



Microarray dataMicroarray data

Features of Microarray dataA few rows: 100-1000

A large number of items, 10000

The space of all the combinations of items is large 210000.



MotivationsMotivations

Very slow for existing rule mining algorithmsItem search space is exponential to the number of item

use the idea of row enumeration to design new algorithm

The number of association rules are too huge even for a given consequent

mine top-k Interesting rule groups for each row



DefinitionsDefinitions

Row support set: Given a set of items I’, we denote R(I’) as the largest set of rows that contain I.

Item support set: Given a set of rows R’, we denote I(R’) as the largest set of items that are common among rows in R’.



Example

I’={a,e,h}, then R(I’)={r2,r3,r4}

R’={r2,r3}, then I(R’)={a,e,h}



Rule GroupsRule Groups

What is rule group?Given a one row dataset: {a, b, c, d, e, Cancer}, 31 rules in the form of LHS Cancer.

the same row and the same confidence (100%).

1 upper bound and 5 lower bound

Rule group: a set of association rules whose LHS itemsets occurs in a same set of rows.Rule group has a unique upper bound. abcde Cancer



Rule groups: exampleRule groups: example

Rule groups: a set of rules covered by the same rows.

upper bound rule

c-> C1 is not in the group

abc->C1(100%)

ab->C1(100%)

a->C1(100%) b->C1(100%)

ac->C1(100%) bc->C1(100%)

upper bound rule

class Items

C1 a,b,c

C1 a,b,c,d

C1 c,d,e

C2 c,d,e



Significant of RGSSignificant of RGS

Rule group r1 is more significant than r2 ifr1.conf>r2.conf or

r1.sup>r2.sup and r1.conf=r2.conf



Finding top-k rule groupsFinding top-k rule groups

Given dataset D, for each row of the dataset, find the k most significant covering rule groups (represented by the upper bounds) subject to the minimum support constraint

No min-confidence required



Top-k covering rule groupsTop-k covering rule groups

For each row, we find the most significant k rule groups:

based on confidence first then support

Given minsup=1, Top-1row 1: abcC1(sup = 2, conf= 100%)row 2: abcC1

abcdC1(sup=1,conf = 100%)row 3: cdC1(sup=2, conf = 66.7%)row 4: cdeC2 (sup=1, conf = 50%)

class Items

C1 a,b,c

C1 a,b,c,d

C1 c,d,e

C2 c,d,e



Relationship with CBARelationship with CBA

The rules selected by CBA for classification are a subset of the rules of TopkRGS with k=1

So we can use top-1 covering rule groups to build the CBA classifier

The authors also proposed a refined classification method based on TopkRGS

Reduces the chance that test data is classified by a default class;

Uses a subset of rules to make a collective decision



Main advantages of Top-k coverage rule group

Main advantages of Top-k coverage rule group

The number is bounded by the product of k and the number of samples

Treat each sample equally provide a complete description for each row (small)

The minimum confidence parameter-- instead k.

Sufficient to build classifiers while avoiding excessive computation



Naïve method of finding TopkRGSNaïve method of finding TopkRGS

Find the complete set of upper bound rules in the dataset by the row-wise algorithmPick the top-k covering rule groups for each row in the datasetInefficient -- To improve, keep track of the top-k rule groups at each enumeration node dynamically and use effective pruning strategies



ReferenceReference

W. Cohen. Fast eective rule induction. In ICML'95,

Xiaoxing Yin & Jiawei Han CPAR: Classification based on Predictive Association Rules,SDM’03

B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In KDD'98

Jiuyong Li, On Optimal Rule Discovery, IEEE Transactions on Knowledge and Data Engineering, Volume 18 , Issue 4, 2006

Some slides are offered by Zhang Xiang at http://www.cs.unc.edu/~xiang/

eecs 800 research seminar mining biological data

Documents