decision tree construction
TRANSCRIPT
![Page 1: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/1.jpg)
The Software Infrastructurefor Electronic Commerce
Databases and Data MiningLecture 4:
An Introduction To Data Mining (II)
Johannes [email protected]
http://www.cs.cornell.edu/johannes
![Page 2: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/2.jpg)
Lectures Three and Four
• Data preprocessing• Multidimensional data analysis• Data mining
• Association rules• Classification trees• Clustering
![Page 3: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/3.jpg)
Types of Attributes
• Numerical: Domain is ordered and can be represented on the real line (e.g., age, income)
• Nominal or categorical: Domain is a finite set without any natural ordering (e.g., occupation, marital status, race)
• Ordinal: Domain is ordered, but absolute differences between values is unknown (e.g., preference scale, severity of an injury)
![Page 4: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/4.jpg)
Classification
Goal: Learn a function that assigns a record to one of several predefined classes.
![Page 5: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/5.jpg)
Classification Example
• Example training database• Two predictor attributes:
Age and Car-type (Sport, Minivan and Truck)
• Age is ordered, Car-type iscategorical attribute
• Class label indicateswhether person boughtproduct
• Dependent attribute is categorical
Age Car Class20 M Yes30 M Yes25 T No30 S Yes40 S Yes20 T No30 M Yes25 M Yes40 M Yes20 S No
![Page 6: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/6.jpg)
Regression Example
• Example training database• Two predictor attributes:
Age and Car-type (Sport, Minivan and Truck)
• Spent indicates how much person spent during a recent visit to the web site
• Dependent attribute is numerical
Age Car Spent 20 M $200 30 M $150 25 T $300 30 S $220 40 S $400 20 T $80 30 M $100 25 M $125 40 M $500 20 S $420
![Page 7: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/7.jpg)
Types of Variables (Review)
• Numerical: Domain is ordered and can be represented on the real line (e.g., age, income)
• Nominal or categorical: Domain is a finite set without any natural ordering (e.g., occupation, marital status, race)
• Ordinal: Domain is ordered, but absolute differences between values is unknown (e.g., preference scale, severity of an injury)
![Page 8: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/8.jpg)
Definitions
• Random variables X1, …, Xk (predictor variables) and Y (dependent variable)
• Xi has domain dom(Xi), Y has domain dom(Y)• P is a probability distribution on
dom(X1) x … x dom(Xk) x dom(Y)Training database D is a random sample from P
• A predictor d is a functiond: dom(X1) … dom(Xk) dom(Y)
![Page 9: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/9.jpg)
Classification Problem
• If Y is categorical, the problem is a classification problem, and we use C instead of Y.|dom(C)| = J.
• C is called the class label, d is called a classifier.• Take r be record randomly drawn from P.
Define the misclassification rate of d:RT(d,P) = P(d(r.X1, …, r.Xk) != r.C)
• Problem definition: Given dataset D that is a random sample from probability distribution P, find classifier d such that RT(d,P) is minimized.
![Page 10: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/10.jpg)
Regression Problem
• If Y is numerical, the problem is a regression problem.
• Y is called the dependent variable, d is called a regression function.
• Take r be record randomly drawn from P. Define mean squared error rate of d:RT(d,P) = E(r.Y - d(r.X1, …, r.Xk))2
• Problem definition: Given dataset D that is a random sample from probability distribution P, find regression function d such that RT(d,P) is minimized.
![Page 11: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/11.jpg)
Goals and Requirements
• Goals:• To produce an accurate
classifier/regression function• To understand the structure of the problem
• Requirements on the model:• High accuracy• Understandable by humans, interpretable• Fast construction for very large training
databases
![Page 12: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/12.jpg)
Different Types of Classifiers
• Linear discriminant analysis (LDA)• Quadratic discriminant analysis (QDA)• Density estimation methods• Nearest neighbor methods• Logistic regression• Neural networks• Fuzzy set theory• Decision Trees
![Page 13: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/13.jpg)
Difficulties with LDA and QDA
• Multivariate normal assumption often not true
• Not designed for categorical variables
• Form of classifier in terms of linear or quadratic discriminant functions is hard to interpret
![Page 14: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/14.jpg)
Histogram Density Estimation
• Curse of dimensionality• Cell boundaries are discontinuities.
Beyond boundary cells, estimate falls abruptly to zero.
![Page 15: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/15.jpg)
Kernel Density Estimation
• How to choose kernel bandwith h?• The optimal h depends on a criterion• The optimal h depends on the form of
the kernel• The optimal h might depend on the
class label• The optimal h might depend on the part
of the predictor space• How to choose form of the kernel?
![Page 16: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/16.jpg)
K-Nearest Neighbor Methods
• Difficulties:• Data must be stored; for classification of
a new record, all data must be available• Computationally expensive in high
dimensions• Choice of k is unknown
![Page 17: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/17.jpg)
Difficulties with Logistic Regression
• Few goodness of fit and model selection techniques
• Categorical predictor variables have to be transformed into dummy vectors.
![Page 18: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/18.jpg)
Neural Networks and Fuzzy Set Theory
Difficulties:• Classifiers are hard to understand• How to choose network topology and
initial weights?• Categorical predictor variables?
![Page 19: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/19.jpg)
What are Decision Trees?
Minivan
Age
Car Type
YES NO
YES
<30 >=30
Sports, Truck
0 30 60 Age
YESYES
NO
Minivan
Sports,Truck
![Page 20: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/20.jpg)
Decision Trees
• A decision tree T encodes d (a classifier or regression function) in form of a tree.
• A node t in T without children is called a leaf node. Otherwise t is called an internal node.
![Page 21: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/21.jpg)
Internal Nodes
• Each internal node has an associated splitting predicate. Most common are binary predicates.Example predicates:• Age <= 20• Profession in {student, teacher}• 5000*Age + 3*Salary – 10000 > 0
![Page 22: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/22.jpg)
Internal Nodes: Splitting Predicates
• Binary Univariate splits:• Numerical or ordered X: X <= c, c in
dom(X)• Categorical X: X in A, A subset dom(X)
• Binary Multivariate splits:• Linear combination split on numerical
variables:Σ aiXi <= c
• k-ary (k>2) splits analogous
![Page 23: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/23.jpg)
Leaf Nodes
Consider leaf node t• Classification problem: Node t is labeled
with one class label c in dom(C)• Regression problem: Two choices
• Piecewise constant model:t is labeled with a constant y in dom(Y).
• Piecewise linear model:t is labeled with a linear model
Y = yt + Σ aiXi
![Page 24: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/24.jpg)
Example
Encoded classifier:If (age<30 and
carType=Minivan)Then YES
If (age <30 and(carType=Sports or carType=Truck))Then NO
If (age >= 30)Then NO
Minivan
Age
Car Type
YES NO
YES
<30 >=30
Sports, Truck
![Page 25: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/25.jpg)
Choice of Classification Algorithm?
• Example study: (Lim, Loh, and Shih, Machine Learning 2000)• 33 classification algorithms• 16 (small) data sets (UC Irvine ML Repository)• Each algorithm applied to each data set
• Experimental measurements:• Classification accuracy• Computational speed• Classifier complexity
![Page 26: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/26.jpg)
Classification Algorithms
• Tree-structure classifiers:• IND, S-Plus Trees, C4.5, FACT, QUEST,
CART, OC1, LMDT, CAL5, T1• Statistical methods:
• LDA, QDA, NN, LOG, FDA, PDA, MDA, POL
• Neural networks:• LVQ, RBF
![Page 27: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/27.jpg)
Experimental Details
• 16 primary data sets, created 16 more data sets by adding noise
• Converted categorical predictor variables to 0-1 dummy variables if necessary
• Error rates for 6 data sets estimated from supplied test sets, 10-fold cross-validation used for the other data sets
![Page 28: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/28.jpg)
Ranking by Mean Error Rate
Rank Algorithm Mean Error Time1 Polyclass 0.195 3 hours2 Quest Multivariate 0.202 4 min3 Logistic Regression 0.204 4 min6 LDA 0.208 10 s8 IND CART 0.215 47 s12 C4.5 Rules 0.220 20 s16 Quest Univariate 0.221 40 s…
![Page 29: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/29.jpg)
Other Results
• Number of leaves for tree-based classifiers varied widely (median number of leaves between 5 and 32 (removing some outliers))
• Mean misclassification rates for top 26 algorithms are not statistically significantly different, bottom 7 algorithms have significantly lower error rates
![Page 30: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/30.jpg)
Decision Trees: Summary
• Powerful data mining model for classification (and regression) problems
• Easy to understand and to present to non-specialists
• TIPS:• Even if black-box models sometimes give higher
accuracy, construct a decision tree anyway• Construct decision trees with different splitting
variables at the root of the tree
![Page 31: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/31.jpg)
Clustering
• Input: Relational database with fixed schema• Output: k groups of records called clusters,
such that the records within a group are more similar to records in other groups
• More difficult than classification (unsupervised learning: no record labels are given)
• Usage:• Exploratory data mining• Preprocessing step (e.g., outlier detection)
![Page 32: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/32.jpg)
Clustering (Contd.)
• In clustering we partitioning a set of records into meaningful sub-classes called clusters.
• Cluster: a collection of data objects that are “similar” to one another and thus can be treated collectively as one group.
• Clustering helps users to detect inherent groupings and structure in a data set.
![Page 33: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/33.jpg)
Clustering (Contd.)
• Example input database: Two numerical variables
• How many groups are here?
• Requirements: Need to define “similarity” between records
Age Salary 20 40 25 50 24 45 23 50 40 80 45 85 42 87 35 82 70 30
![Page 34: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/34.jpg)
Graphical Representation
Customer Demographics
0102030405060708090
100
0 20 40 60 80
Age
Sala
ry in
$10
K
Customers
![Page 35: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/35.jpg)
Clustering (Contd.)
• Output of clustering:• Representative points for each cluster• Labeling of each record with each cluster
number• Other description of each cluster
• Important: Use the “right” distance function• Scale or normalize all attributes. Example:
seconds, hours, days• Assign different weights associated with
importance of the attribute
![Page 36: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/36.jpg)
Clustering: Summary
• Finding natural groups in data• Common post-processing steps:
• Build a decision tree with the cluster label as class label
• Try to explain the groups using the decision tree
• Visualize the clusters• Examine the differences between the
clusters with respect to the fields of the dataset
• Try different number of clusters
![Page 37: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/37.jpg)
Web Usage Mining
• Data sources:• Web server log• Information about the web site:
• Site graph• Metadata about each page (type, objects
shown)• Object concept hierarchies
• Preprocessing:• Detect session and user context (Cookies,
user authentication, personalization)
![Page 38: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/38.jpg)
Web Usage Mining (Contd.)
• Data Mining• Association Rules• Sequential Patterns• Classification
• Action• Personalized pages• Cross-selling
• Evaluation and Measurement• Deploy personalized pages selectively• Measure effectiveness of each implemented action
![Page 39: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/39.jpg)
Large Case Study: Churn
• Telecommunications industry• Try to predict churn (whether customer will
switch long-distance carrier)• Dataset:
• 5000 records (tiny dataset, but manageable here in class)
• 21 attributes, both numerical and categorical attributes (very few attributes)
• Data is already cleaned! No missing values, inconsistencies, etc. (again, for classroom purposes)
![Page 40: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/40.jpg)
Churn Example: Dataset Columns• State• Account length: Number of months the customer has been with the
company• Area code• Phone number• International plan: yes/no• Voice mail: yes/no• Number of voice: Average number of voice messages per day• Total (day, evening, night, international) minutes: Average number of
minutes charged• Total (day, evening, night, international) calls: Average number of calls
made• Total (day, evening, night, international) charge: Average amount charged
per day• Number customer service calls: Number of calls made to customer support
in the last six months• Churned: Did the customer switch long-distance carriers in the last six
months
![Page 41: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/41.jpg)
Churn Example: Analysis
• We start out by getting familiar with the dataset• Record viewer• Statistics visualization• Evidence classifier• Visualizing joint distributions• Visualizing geographic distribution of
churn
![Page 42: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/42.jpg)
Churn Example: Analysis (Contd.)
• Building and interpreting data mining models• Decision trees• Clustering
![Page 43: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/43.jpg)
Evaluating Data Mining Tools
![Page 44: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/44.jpg)
Evaluating Data Mining Tools
• Checklist:• Integration with current applications and your
data management infrastructure• Ease of usage• Automation• Scalability to large datasets
• Number of records• Number of attributes• Datasets larger than main memory• Support of sampling
• Export of models into your enterprise• Stability of the company that offers the product
![Page 45: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/45.jpg)
Integration With Data Management
• Proprietary storage format?• Native support of major database
systems:• IBM DB2, Informix, Oracle, SQL Server,
Sybase• ODBC• Support of parallel database systems
• Integration with your data warehouse
![Page 46: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/46.jpg)
Cost Considerations
• Proprietary or commodity hardware and operating system• Client and server might be different• What server platforms are supported?
• Support staff needed• Training of your staff members
• Online training, tutorials• On-site training• Books, course material
![Page 47: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/47.jpg)
Data Mining Projects
• Checklist:• Start with well-defined business questions• Have a champion within the company• Define measures of success and failure
• Main difficulty: No automation• Understanding the business problem• Selecting the relevant data• Data transformation• Selection of the right mining methods• Interpretation
![Page 48: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/48.jpg)
Understand the Business Problem
Important questions:• What is the problem that we need to solve?• Are there certain aspects of the problem that
are especially interesting?• Do we need data mining to solve the problem?• What information is actionable, and when?• Are there important business rules that
constrain our solution?• What people should we keep in the loop, and
with whom should we discuss intermediate results?
• Who are the (internal) customers of the effort?
![Page 49: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/49.jpg)
Hiring Outside Experts?
Factors:• One-time problem versus ongoing
process• Source of data• Deployment of data mining models• Availability and skills of your own
staff
![Page 50: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/50.jpg)
Hiring Experts
Types of experts:• Your software vendor• Consulting companies/centers/individualsYour goal: Develop in-house expertise
![Page 51: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/51.jpg)
The Data Mining Market
• Revenues for the data mining market:$8 billion (Mega Group 1/1999)
• Sales of data mining software (Two Crows Corporation 6/99)• 1998:$50 million• 1999:$75 million• 2000: $120 million
• Hardware companies often use their data mining software as loss-leaders (Examples: IBM, SGI)
![Page 52: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/52.jpg)
Knowledge Management in General
Percent of information technology executives citing the systems used in their knowledge management strategy (IW 4/1999)
• Relational Database 95%• Text/Document Search 80%• Groupware 71%• Data Warehouse 65%• Data Mining Tools 58%• Expert Database/AI Tools 25%
![Page 53: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/53.jpg)
Crossing the Chasm
• Data mining is currently trying to cross this chasm.
• Great opportunities, but also great perils. • You have a unique advantage by applying
data mining “the right way”.• It is not yet common knowledge how to
apply data mining “the right way”.• No major cooking recipes to make a data
mining project work (yet).
![Page 54: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/54.jpg)
Summary
• Database and data mining technology is crucial for any enterprise
• We talked about the complete data management infrastructure• DBMS technology• Querying• WWW/DBMS integration• Data warehousing and dimensional modeling• OLAP• Data mining
![Page 55: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/55.jpg)
Additional Material: Web Sites
• Data mining companies, jobs, courses, publications, datasets, etc:www.kdnuggets.com
• ACM Special Interest Group on Knowledge Discovery and Data Miningwww.acm.org/sigkdd
![Page 56: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/56.jpg)
Additional Material: Books
• U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996
• Michael Berry & Gordon Linoff, Data Mining Techniques for Marketing, Sales and Customer Support, John Wiley & Sons, 1997.
• Ian Witten and Eibe Frank, Data Mining, Practical Machine Learning Tools and Techniques with Java Implementations, Oct 1999
• Michael Berry & Gordon Linoff, Mastering Data Mining, John Wiley & Sons, 2000.
![Page 57: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/57.jpg)
Additional Material: Database Systems
• IBM DB2: www.ibm.com/software/data/db2
• Oracle: www.oracle.com• Sybase: www.sybase.com• Informix: www.informix.com• Microsoft: www.microsoft.com/sql • NCR Teradata:
www.ncr.com/product/teradata
![Page 58: Decision Tree Construction](https://reader034.vdocuments.us/reader034/viewer/2022042611/586a18511a28ab430d8b65da/html5/thumbnails/58.jpg)
Questions?
“Prediction is very difficult, especially about the future.”
Niels Bohr