data mining with decision trees lutz hamel dept. of computer science and statistics university of...
Post on 21-Dec-2015
213 views
TRANSCRIPT
![Page 1: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/1.jpg)
Data Mining with Decision Trees
Lutz HamelDept. of Computer Science and Statistics
University of Rhode Island
![Page 2: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/2.jpg)
What is Data Mining?
Data mining is the application of machine learning techniques to large databases in order to extract knowledge.
(KDD Knowledge Discovery in Databases)
No longer strictly true, data mining now encompasses other computational techniques outside the classic machine learning domain.
![Page 3: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/3.jpg)
What is Machine Learning?
Programs that get better with experience given a task and some performance measure. Learning to classify news articles Learning to recognize spoken words Learning to play board games Learning to classify customers
![Page 4: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/4.jpg)
What is Knowledge?
Structural descriptions of data (transparent) If-then-else rules Decision trees
Models of data (non-transparent) Neural Networks Clustering (Self-Organizing Maps, k-Means) Naïve-Bayes Classifiers
![Page 5: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/5.jpg)
Why Data Mining?
Oversimplifying somewhat:Queries allow you to retrieve existing knowledge
from a database.
Data mining induces new knowledge in a database.
![Page 6: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/6.jpg)
Why Data Mining? (Cont.)
Example: Give me a description of customers who spent more than $100 in my store.
![Page 7: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/7.jpg)
Why Data Mining? (Cont.)
The Query: The only thing a query can do is give you a
list of every single customer who spent more than $100.
Probably not very informative with the exception that you will most likely see a lot of customer records.
![Page 8: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/8.jpg)
Why Data Mining? (Cont.)
Data Mining Techniques: Data mining techniques allow you to
generate structural descriptions of the data in question, i.e., induce new knowledge.
In the case of rules this might look something like:
IF age < 35 AND car = MINIVAN
THEN spent > $100
![Page 9: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/9.jpg)
Why Data Mining? (Cont.)
In principle, you could generate the same kind of knowledge you gain with data mining techniques using only queries: look at the data set of customers who spent more that
$100 and propose a hypothesis test this hypothesis against your data using a query if the query returns a non-null result set then you have
found a description of a subset of your customers Time consuming, undirected search.
![Page 10: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/10.jpg)
Decision Trees
Decision trees are concept learning algorithms Once a concept is acquired the algorithm can classify objects
according to this concept. Concept Learning:
acquiring the definition of a general category given a sample of positive and negative examples of the category,
can be formulated as a problem of searching through a predefined space of potential concepts for the concept that best fits the training examples.
Best known algorithms: ID3, C4.5, CART
![Page 11: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/11.jpg)
Example
Systolic Blood Pressure White Blood Count Diagnosis110 13000 MI90 12000 MI85 18000 MI
120 8000 MI130 18000 MI180 5000 Angina200 7500 Angina165 6000 Angina190 6500 Angina120 9000 Angina
MI Myocardial Infarction
(Source: Neural Networks and Artificial Intelligence for Biomedical Engineering, IEEE Press, 2000)
Below is a table of patients who have entered the emergency room complaining about chest pain two types of diagnoses: Angina and Myocardial Infarction.
Question: can we generalize beyond this data?
![Page 12: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/12.jpg)
Example (Contd)
C4.5 induces the following decision tree for the data:
0
5000
10000
15000
20000
0 50 100 150 200 250
Systolic Blood Pressure
Whi
te B
lood
Cou
nt
MI Angina
Decision Surface
Systolic Blood Pressure
> 130<= 130
AnginaMyocardial Infarction
![Page 13: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/13.jpg)
Definition of Concept Learning
Notes: This is called supervised learning because of the necessity of labeled data
provided by the trainer. Once we have determined c we can use it to make predictions on unseen
elements of the data universe.
Given: A data universe X A sample set S, where S Ì X Some target concept c: X ® {true, false} Labeled training examples D, where
D = { < s, c(s) > | sÎS } Using D determine:
A function c such that c(x) @ c(x) for all xÎX.
![Page 14: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/14.jpg)
The Inductive Learning Hypothesis
Any function found to approximate the target concept well over a sufficiently large set of training examples will also approximate the target concept well over other unobserved examples.
In other words, we are able to generalize beyond what we have seen.
![Page 15: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/15.jpg)
Recasting our Example as a Concept Learning Problem
The data universe X are ordered pairs of the form Systolic Blood Pressure White Blood Count
The sample set SÌX is the table of value pairs we are given
Target concept: Diagnosis: X ® {Angina, MI} Training examples is the table where
D = {< s, Diagnosis(s) > | s Î S } Find a function Diagnosis that best describes D.
![Page 16: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/16.jpg)
Recasting our Example as a Concept Induction Problem
A definition of the learned function Diagnosis:
Diagnosis (Systolic Blood Pressure, White Blood Count) = IF Systolic Blood Pressure > 130 THEN Diagnosis = Angina ELSE IF Systolic Blood Pressure <= 130 THEN Diagnosis = MI.
![Page 17: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/17.jpg)
Decision Tree Representation
We can represent the learned function as a tree: Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification
Systolic Blood Pressure
> 130<= 130
AnginaMyocardial Infarction
![Page 18: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/18.jpg)
Entropy S is a sample of training
examples p+ is the proportion of
positive examples in S p- is the proportion of
negative examples in S Entropy measures the
impurity (randomness) of S
Entropy(S) º - p+ log2 p+ - p- log2 p-
p+
![Page 19: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/19.jpg)
Top-down Induction of Decision Trees
Recursive Algorithm Main loop:
Let attribute A be the attribute that minimizes the entropy at the current node
For each value of A, create new descendant of node Sort training examples to leaf nodes If training examples classified satisfactorily, then
STOP, else iterate over new leaf nodes.
![Page 20: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/20.jpg)
Information Gain
Gain(S, A) = expected reduction in entropy due to sorting on A.
})(|{ where
)(||
||)(),(
)(
vsASsS
SEntropyS
SSEntropyASGain
v
vAValuesv
v
In other words, Gain(S, A) is the information providedabout the target concept, given the value of some attribute A.
![Page 21: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/21.jpg)
Training, Evaluation and Prediction
We know how to induce classification rules on the data, but:
How do we measure performance? How do we use our rules to do prediction?
![Page 22: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/22.jpg)
Training & EvaluationThe simplest method of measuring performance is the hold-
out method: Given labeled data D, we divide the data into two sets:
A hold-out (test) set Dh of size h,
A training set Dt = D Dh.
The error of the induced function ct is given as follows:
where d(p, q) = 1 if p ¹ q and 0 otherwise.
>Î<
=hDscs
th scsch
error)(,
))(),('(1 d
![Page 23: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/23.jpg)
Training & Evaluation
However, since we trained and evaluated the learner on a finite set of data we want to know what the confidence interval is.
We can compute the 95% confidence interval of errorh as follows, Assume that the hold-out set Dh has h ³ 30 members.
Assume that each d in Dh has been selected independently and according to the probability distribution over the domain.Then:
h
errorerrorerror hh
h
)1(96.1
-±
![Page 24: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/24.jpg)
Prediction
As we have said earlier, the induced function c@ c, that is, the induced function is an estimate of the target concept.
Therefore, we can use c to estimate (predict) the label for any unseen instance xÎX with an appropriate accuracy.
![Page 25: Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d545503460f94a31b86/html5/thumbnails/25.jpg)
Summary
Data Mining is the application of machine learning algorithms to large databases in order to induce new knowledge.
Machine Learning can be considered to be a directed search over the space of all possible descriptions of the training data for the best description of the data set that also generalizes well to unseen instances.
Decision trees are concept learning algorithms that learn classification functions.