data mining – output: knowledge representation

Data Mining – Output: Knowledge Representation

Chapter 3

Representing Structural Patterns

• There are many different ways of representing patterns

• 2 covered in Chapter 1 – decision trees and classification rules

• Learned pattern is a form of “knowledge representation” (even if the knowledge does not seem very impressive)

Decision Trees

• Make decisions by following branches down the tree until a leaf is found

• Classification based on contents of leaf• Non-leaf node usually involve testing a single

attribute– Usually for different values of nominal attributes, or for

range of a numeric attribute (most commonly a two way split, > some value and < same value)

– Less commonly, compare two attribute values, or some function of multiple attributes

• Common for an attribute once used to not be used at a lower level of same branch

Decision Trees

• Missing Values– May be treated as another possible value of a

nominal attribute – if missing data may mean something

– May follow most popular branch when data is missing from test data

– More complicated approach – rather than going all-or-nothing, can ‘split’ the test instance in proportion to popularity of branches in test data – recombination at end will use vote based on weights

Classification Rules• Popular alternative to decision trees• LHS / antecedent / precondition – tests to determine if rule

is applicable– Tests usually ANDed together– Could be general logical condition (AND/OR/NOT) but learning

such rules is MUCH less constrained

• RHS / consequent / conclusion – answer –usually the class (but could be a probability distribution)

• Rules with the same conclusion essentially represent an OR• Rules may be an ordered set, or independent• If independent, policy may need to be established for if

more than one rule matches (conflict resolution strategy) or if no rule matches

Rules / Trees

• Rules can be easily created from a tree – but not the most simple set of rules

• Transforming rules into a tree is not straightforward (see “replicated subtree” problem – next two slides)

• In many cases rules are more compact than trees – particularly if default rule is possible

• Rules may appear to be independent nuggets of knowledge (and hence less complicated than trees) – but if rules are an ordered set, then they are much more complicated than they appear

Figure 3.1 Decision tree for a simple disjunction.

If a and b then xIf c and d then x

Figure 3.3 Decision tree with a replicated subtree.

If x=1 and y=1 then class = aIf z=1 and w=1 then class = aOtherwise class = b

Each gray triangle actually contains the whole gray subtree below

Association Rules

• Association Rules are not intended to be used together as a set – in fact value is in the knowledge – probably no automatic use of rules

• Large numbers of possible rules

Association Rule Evaluation

• Coverage – the number of instances for which it predicts correctly – also called support

• Accuracy – proportion of instances that it predicts correctly – also called confidence

• Coverage sometimes expressed as percent of the total # instances

• Usually methods or users specify minimum coverage and accuracy for rules to be generated

• Some possible rules imply others – present the strongest supported

Example – My Weather – Apriori Algorithm

AprioriMinimum support: 0.15Minimum metric <confidence>: 0.9Number of cycles performed: 17Best rules found: 1. outlook=rainy 5 ==> play=no 5 conf:(1) 2. temperature=cool 4 ==> humidity=normal 4 conf:(1) 3. temperature=hot windy=FALSE 3 ==> play=no 3 conf:(1) 4. temperature=hot play=no 3 ==> windy=FALSE 3 conf:(1) 5. outlook=rainy windy=FALSE 3 ==> play=no 3 conf:(1) 6. outlook=rainy humidity=normal 3 ==> play=no 3 conf:(1) 7. outlook=rainy temperature=mild 3 ==> play=no 3 conf:(1) 8. temperature=mild play=no 3 ==> outlook=rainy 3 conf:(1) 9. temperature=hot humidity=high windy=FALSE 2 ==> play=no 2 conf:(1)10. temperature=hot humidity=high play=no 2 ==> windy=FALSE 2 conf:(1)

Rules with Exceptions

• Skip

Rules involving Relations

• More than the value for attributes may be important

• See book example on next slide

Figure 3.6 The shapes problem.

Shaded: standing

Unshaded: lying

More Complicated – Winston’s Blocks World

• House – 3 sided block & 4 sided block AND 3 sided is on top of 4 sided

• Solutions frequently involve learning rules that include variables/parameters– E.g. 3sided(block1) & 4sided(block2) &

ontopof(block1,block2) house

Easier and Sometimes Useful

• Introduce new attributes during data preparation• New attribute represents relationship

– E.g. for the standing / lying task could introduce new boolean attribute: widthgreater? which would be filled in for each instance during data prep

– E.g. in numeric weather, could introduce “WindChill” based on calculations from temperature and wind speed (if numeric) or “Heat Index” based on temperature and humidity

Numeric Prediction

• Standard for comparison for numeric prediction is the statistical technique of regression

• E.g. for the CPU performance data the regression equation below was derived

PRP = - 56.1 + 0.049 MYCT + 0.015 MMIN + 0.006 MMAX + 0.630 CACH - 0.270 CHMIN + 1.46 CHMAX

Trees for Numeric Prediction

• Tree branches as in a decision tree (may be based on ranges of attributes)

• Regression Tree – leaf nodes contain average of training set values that the leaf applies to

• Model Tree – leaf nodes contain regression equations for the instances that the leaf applies to

Figure 3.7(b) Models for the CPU performance data: regression tree.

Figure 3.7(c) Models for the CPU performance data: model tree.

Instance Based Representation• Concept not really represented (except via examples)• Real World Example – some radio stations don’t define

what they play by words, they play promos basically saying “WXXX music is:” <songs>

• Training examples are merely stored (kind of like “rote learning”)

• Answers are given by finding the most similar training example(s) to test instance at testing time

• Has been called “lazy learning” – no work until an answer is needed

Instance Based – Finding Most Similar Example

• Nearest Neighbor – each new instance is compared to all other instances, with a “distance” calculated for each attribute for each instance

• Class of nearest neighbor instance is used as the prediction <see next slide and come back>

• OR K-nearest neighbors vote, or weighted vote• Combination of distances – city block or

euclidean (crow flies)

Nearest Neighbor

T

x

x

x•x

•x

•x

•x•x

•x

•y

•y•y

•y

•y

•y

•y

•y

•y

•z

•z•z

•z•z

•z

•z

•z

•z

Additional Details

• Distance/ Similarity function must deal with binaries/nominals – usually by all or nothing match – but mild should be a better match to hot than cool is!

• Distance / Similarity function is simpler if data is normalized in advance. E.g. $10 difference in household income is not significant, while 1.0 distance in GPA is big

• Distance/Similarity function should weight different attributes differently – key task is determining those weights

Further Wrinkles

• May not need to save all instances– Very normal instances may not all need be be saved– Some approaches actually do some generalization

But …

• Not really a structural pattern that can be pointed to

• However, many people in many task/domains will respect arguments based on “previous cases” (diagnosis, law among them)

• Book points out that instances + distance metric combine to form class boundaries– With 2 attributes, these can actually be envisioned

<see next slide>

Figure 3.8 Different ways of partitioning the instance space.

(a) (b) (c) (d)

Clustering

• Clusters may be able to be represented graphically • If dimensionality is high, best representation may only be

tabular – showing which instances are in which clusters• Show Weka – do njcrimenominal with EM and then do

visualization of results• In some algorithms associate instances with clusters

probabilistically – for every instance, list probability of membership in each of the clusters

• Some algorithms produce a hierarchy of clusters and these can be visualized using a tree diagram

• After clustering, clusters may be used as class for classification

g a c i e d k b j f h

a

k

j

i

h

g

f

ed

c

b

a

k

j

i

h

g

f

ed

c

b

Figure 3.9 Different ways of representing clusters.

1 2 3

a 0.4 0.1 0.5b 0.1 0.8 0.1c 0.3 0.3 0.4d 0.1 0.1 0.8e 0.4 0.2 0.4f 0.1 0.4 0.5g 0.7 0.2 0.1h 0.5 0.4 0.1…

(a) (b)

(c) (d)

End Chapter 3

data mining – output: knowledge representation

Documents