Data Mining with Decision Trees
Lutz HamelDept. of Computer Science and Statistics
University of Rhode Island
What is Data Mining?
Data mining is the application of machine learning techniques to large databases in order to extract knowledge.
(KDD Knowledge Discovery in Databases)
No longer strictly true, data mining now encompasses other computational techniques outside the classic machine learning domain.
What is Machine Learning?
Programs that get better with experience given a task and some performance measure. Learning to classify news articles Learning to recognize spoken words Learning to play board games Learning to classify customers
What is Knowledge?
Structural descriptions of data (transparent) If-then-else rules Decision trees
Models of data (non-transparent) Neural Networks Clustering (Self-Organizing Maps, k-Means) Naïve-Bayes Classifiers
Why Data Mining?
Oversimplifying somewhat:Queries allow you to retrieve existing knowledge
from a database.
Data mining induces new knowledge in a database.
Why Data Mining? (Cont.)
Example: Give me a description of customers who spent more than $100 in my store.
Why Data Mining? (Cont.)
The Query: The only thing a query can do is give you a
list of every single customer who spent more than $100.
Probably not very informative with the exception that you will most likely see a lot of customer records.
Why Data Mining? (Cont.)
Data Mining Techniques: Data mining techniques allow you to
generate structural descriptions of the data in question, i.e., induce new knowledge.
In the case of rules this might look something like:
IF age < 35 AND car = MINIVAN
THEN spent > $100
Why Data Mining? (Cont.)
In principle, you could generate the same kind of knowledge you gain with data mining techniques using only queries: look at the data set of customers who spent more that
$100 and propose a hypothesis test this hypothesis against your data using a query if the query returns a non-null result set then you have
found a description of a subset of your customers Time consuming, undirected search.
Decision Trees
Decision trees are concept learning algorithms Once a concept is acquired the algorithm can classify objects
according to this concept. Concept Learning:
acquiring the definition of a general category given a sample of positive and negative examples of the category,
can be formulated as a problem of searching through a predefined space of potential concepts for the concept that best fits the training examples.
Best known algorithms: ID3, C4.5, CART
Example
Systolic Blood Pressure White Blood Count Diagnosis110 13000 MI90 12000 MI85 18000 MI
120 8000 MI130 18000 MI180 5000 Angina200 7500 Angina165 6000 Angina190 6500 Angina120 9000 Angina
MI Myocardial Infarction
(Source: Neural Networks and Artificial Intelligence for Biomedical Engineering, IEEE Press, 2000)
Below is a table of patients who have entered the emergency room complaining about chest pain two types of diagnoses: Angina and Myocardial Infarction.
Question: can we generalize beyond this data?
Example (Contd)
C4.5 induces the following decision tree for the data:
0
5000
10000
15000
20000
0 50 100 150 200 250
Systolic Blood Pressure
Whi
te B
lood
Cou
nt
MI Angina
Decision Surface
Systolic Blood Pressure
> 130<= 130
AnginaMyocardial Infarction
Definition of Concept Learning
Notes: This is called supervised learning because of the necessity of labeled data
provided by the trainer. Once we have determined c we can use it to make predictions on unseen
elements of the data universe.
Given: A data universe X A sample set S, where S Ì X Some target concept c: X ® {true, false} Labeled training examples D, where
D = { < s, c(s) > | sÎS } Using D determine:
A function c such that c(x) @ c(x) for all xÎX.
The Inductive Learning Hypothesis
Any function found to approximate the target concept well over a sufficiently large set of training examples will also approximate the target concept well over other unobserved examples.
In other words, we are able to generalize beyond what we have seen.
Recasting our Example as a Concept Learning Problem
The data universe X are ordered pairs of the form Systolic Blood Pressure White Blood Count
The sample set SÌX is the table of value pairs we are given
Target concept: Diagnosis: X ® {Angina, MI} Training examples is the table where
D = {< s, Diagnosis(s) > | s Î S } Find a function Diagnosis that best describes D.
Recasting our Example as a Concept Induction Problem
A definition of the learned function Diagnosis:
Diagnosis (Systolic Blood Pressure, White Blood Count) = IF Systolic Blood Pressure > 130 THEN Diagnosis = Angina ELSE IF Systolic Blood Pressure <= 130 THEN Diagnosis = MI.
Decision Tree Representation
We can represent the learned function as a tree: Each internal node tests an attribute Each branch corresponds to attribute value Each leaf node assigns a classification
Systolic Blood Pressure
> 130<= 130
AnginaMyocardial Infarction
Entropy S is a sample of training
examples p+ is the proportion of
positive examples in S p- is the proportion of
negative examples in S Entropy measures the
impurity (randomness) of S
Entropy(S) º - p+ log2 p+ - p- log2 p-
p+
Top-down Induction of Decision Trees
Recursive Algorithm Main loop:
Let attribute A be the attribute that minimizes the entropy at the current node
For each value of A, create new descendant of node Sort training examples to leaf nodes If training examples classified satisfactorily, then
STOP, else iterate over new leaf nodes.
Information Gain
Gain(S, A) = expected reduction in entropy due to sorting on A.
})(|{ where
)(||
||)(),(
)(
vsASsS
SEntropyS
SSEntropyASGain
v
vAValuesv
v
In other words, Gain(S, A) is the information providedabout the target concept, given the value of some attribute A.
Training, Evaluation and Prediction
We know how to induce classification rules on the data, but:
How do we measure performance? How do we use our rules to do prediction?
Training & EvaluationThe simplest method of measuring performance is the hold-
out method: Given labeled data D, we divide the data into two sets:
A hold-out (test) set Dh of size h,
A training set Dt = D Dh.
The error of the induced function ct is given as follows:
where d(p, q) = 1 if p ¹ q and 0 otherwise.
>Î<
=hDscs
th scsch
error)(,
))(),('(1 d
Training & Evaluation
However, since we trained and evaluated the learner on a finite set of data we want to know what the confidence interval is.
We can compute the 95% confidence interval of errorh as follows, Assume that the hold-out set Dh has h ³ 30 members.
Assume that each d in Dh has been selected independently and according to the probability distribution over the domain.Then:
h
errorerrorerror hh
h
)1(96.1
-±
Prediction
As we have said earlier, the induced function c@ c, that is, the induced function is an estimate of the target concept.
Therefore, we can use c to estimate (predict) the label for any unseen instance xÎX with an appropriate accuracy.
Summary
Data Mining is the application of machine learning algorithms to large databases in order to induce new knowledge.
Machine Learning can be considered to be a directed search over the space of all possible descriptions of the training data for the best description of the data set that also generalizes well to unseen instances.
Decision trees are concept learning algorithms that learn classification functions.