decision tree algorithms rule based suitable for automatic generation

27
Decision Tree Algorithms Rule Based Suitable for automatic generation

Upload: isaac-pierce

Post on 17-Jan-2018

223 views

Category:

Documents


0 download

DESCRIPTION

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 8-3 Goal-Driven Data Mining Define goal –Identify fraudulent cases Develop rules identifying attributes attaining that goal –IF attorney = Smith, THEN better check

TRANSCRIPT

Page 1: Decision Tree Algorithms Rule Based Suitable for automatic generation

Decision Tree Algorithms

Rule Based

Suitable for automatic generation

Page 2: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-2

Decision trees• Logical branching• Historical:

– ID3 – early rule- generating system

• Branches:– Different possible

values• Nodes:

– From which branches emanate

Page 3: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-3

Goal-Driven Data Mining

• Define goal– Identify fraudulent cases

• Develop rules identifying attributes attaining that goal– IF attorney = Smith, THEN better check

Page 4: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-4

Tree Structure• Sorts out data

– IF THEN rules– Loan variables

• Age: {young, middle, old}• Income: {low, average, high}• Risk: {low, medium, high}

• Exhaustive tree enumerates all combinations– 81 combinations – classify all

Page 5: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-5

Types of Trees

• Classification tree– Variable values classes– Finite conditions

• Regression tree– Variable values continuous numbers– Prediction or estimation

Page 6: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-6

Rule Induction• Automatically process data

– Classification (logical, easier)– Regression (estimation, messier)

• Search through data for patterns & relationships– Pure knowledge discovery

• Assumes no prior hypothesis• Disregards human judgment

Page 7: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-7

Example• Three variables:

– Age– Income– Risk

• Outcomes:– On-time– Late

Page 8: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-8

CombinationsVariable Value Cases OT Late Pr(OT)Age Young 12 8 4 0.67

Middle 5 4 1 0.80Old 3 3 0 1.00

Income Low 5 3 2 0.60Average 9 7 2 0.78High 6 5 1 0.83

Risk High 9 5 4 0.55Average 1 0 1 0.00Low 10 10 0 1.00

Page 9: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-9

Basis for Classification

• If a category has all outcomes of a certain kind, that makes a good rule– IF income = High, they always paid

• ENTROPY: Measure of content – Actually measure of randomness

Page 10: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-10

Entropy formulaInformation = -{p/(p+n)}log2 {p/(p+n)}-{n/(p+n)}log2 {n/(p+n)}

The lower the measure, the greater the information content

Can use to automatically select variable with most productive rule potential

Page 11: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-11

Entropy• Young

- 8/12 x -0.390 – 4/12 x -0.528 x 12/20: 0.551

• Middle- 4/5 x -0.258 – 1/5 x -0.464 x 5/20: 0.180

• Old- 3/3 x 0 – 0/3 x 0 x 3/20: 0.000

SUM 0.731Income 0.782Risk 0.446

Page 12: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-12

Rule

1. IF(Risk = Low) THEN OT2. ELSE LATE

Page 13: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-13

All Rules

1. IF Risk=Low OT2. IF Risk NOT Low & Age=Middle Late3. IF Risk NOT Low & Age NOT Middle &

Income=High Late4. ELSE OT

Page 14: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-14

Sample Case

• Age 36 Middle• Income $70K/year Average• Risk:

– Assets $42K– Debts $40K– Wants $5K Average

• Rule 2 applies, says Late

Page 15: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-15

Fuzzy Decision Trees

• Have assumed distinct (crisp) outcomes• Many data points not that clear• Fuzzy: Membership function represents

belief (between 0 and 1)• Fuzzy relationships have been

incorporated in decision tree algorithms

Page 16: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-16

Fuzzy ExampleAge Young 0.3 Middle 0.9 Old 0.2Income Low 0.0 Average 0.8 High 0.3Risk Low 0.1 Average 0.8 High 0.3• Definitions:

– Sum will not necessarily equal 1.0– If ambiguous, select alternative with larger

membership value– Aggregate with mean

Page 17: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-17

Fuzzy Model• IF Risk=Low Then OT

– Membership function: 0.1• IF Risk NOT Low & Age=Middle Then Late

– Risk MAX(0.8, 0.3)– Age 0.9– Membership function: Mean = 0.85

Page 18: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-18

Fuzzy Model cont.

• IF Risk NOT Low & Age NOT Middle & Income=High THEN Late– Risk MAX(0.8, 0.3) 0.8– Age MAX(0.3, 0.2) 0.3– Income 0.3– Membership function: Mean = 0.433

Page 19: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-19

Fuzzy Model cont.

• IF Risk NOT Low & Age NOT Middle & Income NOT High THEN Late– Risk MAX(0.8, 0.3) 0.8– Age MAX(0.3, 0.2) 0.3– Income MAX(0.0, 0.8) 0.8– Membership function: Mean = 0.633

Page 20: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-20

Fuzzy Model cont.

• Highest membership function is 0.633, for Rule 4

• Conclusion: On-time

Page 21: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-21

Applications

• Inventory Prediction• Clinical Databases• Software Development Quality

Page 22: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-22

Inventory Prediction• Groceries

– Maybe over 100,000 SKUs– Barcode data input

• Data mining to discover patterns– Random sample of over 1.6 million records– 30 months– 95 outlets– Test sample 400,000 records

• Rule induction more workable than regression– 28,000 rules– Very accurate, up to 27% improvement

Page 23: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-23

Clinical Database• Headache

– Over 60 possible causes• Exclusive reasoning uses negative rules

– Use when symptom absent• Inclusive reasoning uses positive rules• Probabilistic rule induction expert system

– Headache: Training sample over 50,000 cases, 45 classes, 147 attributes

– Meningitis: 1200 samples on 41 attributes, 4 outputs

Page 24: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-24

Clinical Database• Used AQ15, C4.5

– Average accuracy 82%• Expert System

– Average accuracy 92%• Rough Set Rule System

– Average accuracy 70%• Using both positive & negative rules from

rough sets– Average accuracy over 90%

Page 25: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-25

Software Development Quality• Telecommunications company• Goal: find patterns in modules being

developed likely to contain faults discovered by customers– Typical module several million lines of code– Probability of fault averaged 0.074

• Apply greater effort for those– Specification, testing, inspection

Page 26: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-26

Software Quality• Preprocessed data• Reduced data• Used CART

– (Classification & Regression Trees)– Could specify prior probabilities

• First model 9 rules, 6 variables– Better at cross-validation– But variable values not available until late

• Second model 4 rules, 2 variables– About same accuracy, data available earlier

Page 27: Decision Tree Algorithms Rule Based Suitable for automatic generation

McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved

8-27

Decision Trees

• Very effective & useful• Automatic machine learning

– Thus unbiased (but omit judgment)• Can handle very large data sets

– Not affected much by missing data• Lots of software available