visualization for classification and clustering techniques

33
Visualization for Classification and Clustering Techniques Marc René CSE 8331 Data Mining - Project 1

Upload: neal

Post on 08-Feb-2016

99 views

Category:

Documents


2 download

DESCRIPTION

Visualization for Classification and Clustering Techniques. Marc René CSE 8331 Data Mining - Project 1. Overview. Importance of Data Visualization in the KDD Process Understanding and Trust Visualization techniques Classification Clustering Future Directions. KDD Process. Selection - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Visualization for  Classification and Clustering Techniques

Visualization for Classification and Clustering

Techniques

Marc RenéCSE 8331 Data Mining - Project 1

Page 2: Visualization for  Classification and Clustering Techniques

Marc René - CSE 83312

Overview

Importance of Data Visualization in the KDD Process

Understanding and Trust Visualization techniques

– Classification– Clustering

Future Directions

Page 3: Visualization for  Classification and Clustering Techniques

Marc René - CSE 83313

KDD Process

Selection– Obtain data from all of sources

Preprocessing– After selecting the data, clean it to make sure it is consistent

Transformation– After preprocessing the data, analyze the format/amount of data

Data Mining– Once the data is in a useable format/content, apply various algorithms

based upon the results trying to be achieved Interpretation/Evaluation

– Finally, present the results of the data mining step to the user, so that the results can be used to solve the business need at hand

Page 4: Visualization for  Classification and Clustering Techniques

Marc René - CSE 83314

It is important to consider that the KDD process is useless if the results are not understandable

Importance of Data Visualization

The final step in the KDD process : Highly dependent on the Data Visualization technique Bad/inappropriate technique may result in

misunderstanding Misunderstanding may cause an incorrect (or no)

decision

Page 5: Visualization for  Classification and Clustering Techniques

Marc René - CSE 83315

Current Issues w/Data Visualization

The literature suggests a significant reliance on expert users

General lack of data visualization support in many data mining tools [Goebel99]

These are significant problems if KDD/DM/Data Visualization will expand at the rates suggested– Data visualization tool market – $2.2 billion by 2007

[Nuttall03]

Page 6: Visualization for  Classification and Clustering Techniques

Marc René - CSE 83316

Suggested Direction

Need to determine techniques that balance simplicity with completeness

If this can be done for non-expert users– Simplicity & Completeness Understanding– Understanding Trust– Trust more use of KDD/DM– Result will be:

Better business value Higher ROI

Page 7: Visualization for  Classification and Clustering Techniques

Marc René - CSE 83317

Common Visualization Techniques

Visualization techniques dependent upon – The type of data mining technique chosen – The underlying structure and attributes of the data

Classification Clustering Decision Trees - Scatter Plots Scatter Plots - Dendrograms Axis-Parallel Decision Trees - Smoothed Data Histograms Circle Segments - Self-Organizing Maps Decision Tables - Proximity Matrixes

Page 8: Visualization for  Classification and Clustering Techniques

Marc René - CSE 83318

Classification

Page 9: Visualization for  Classification and Clustering Techniques

Marc René - CSE 83319

Decision Tree

Information limited to – Attributes– Splitting values – Terminal node class

assignments

Page 10: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833110

Decision Tree with Histograms

Data mining rarely classify 100% of the data correctly:

– Include the success of properly classifying the data - histogram added for each terminal node

– Percentage of data that was classified correctly/incorrectly

– Assists users in determining if the classification is ‘good enough’

Page 11: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833111

Decision Tree - Different Format

Vertical representation - allows for easy user interaction

– Combines the split points and classification accuracy - compactly

– Key difference - colors are matched with a specific classification

Page 12: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833112

Scatter Plot with Regression Line

Excellent way to view 2-dimensional data

Familiar to anyone who has taken high-school algebra

Regression lines provide descriptive techniques for classification

Page 13: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833113

Axis-Parallel Decision Tree

Combination Scatter Plot and Decision Tree

Areas divided in parallel regions on the axis

Well suited for classification problems with two attribute values

High visibility into the impact of outliers

Page 14: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833114

Circle Segments

Multi-dimension data Maps dataset with n

dimensions onto a circle divided by n segments

– Each segment is a different attribute

– Each pixel inside a segment is a single value of the attribute

– Values of each attribute are then sorted (independently) and assigned a different colors based upon its class

Page 15: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833115

Decision Table

Interactive technique Maps attribute data to a 2D hierarchical matrix Levels can be drilled down - another set of attributes Height of a cell conveys the number of data entities Cells color coded

– Neutral color no data in that intersection point– Color coded by class (percentage)

Page 16: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833116

Decision Table

Page 17: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833117

Clustering

Page 18: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833118

Scatter Plot

Extensions include, displaying points in:– Various sizes and colors to indicate additional attributes– Shading of points to introduce a third dimension – Using different brightness levels of the same color to represent

continuous values for the same attribute – Using various points or classification identifiers (i.e., numbers,

symbols) – Using various glyphs to display additional attributes

Page 19: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833119

Scatter Plot

Map decision trees on top of scatter plots to describe clusters

Page 20: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833120

Scatter Plot with Regression Lines

Page 21: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833121

Scatter Plot w/Min Spanning Tree

Page 22: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833122

Dendrogram

Intuitive representation - hierarchical decomposition of data into sets of nested clusters.

From an agglomerative perspective: – Each leaf - a single data entity – Each internal node - the union of all data

entities in its sub-tree– The root - the entire dataset – The height of any internal node - the

similarity between its ‘children’.

Page 23: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833123

Dendrogram with Exemplars

The “most typical member of each cluster” [Wishart99]

– Underlined labels of the leafs

– Done in combination with shading to identify the clustering level

Page 24: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833124

Smoothed Data Histogram

Represents data on a ‘display map’

Similar data items are located close to each other

More defined the clusters – lighter colors

Page 25: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833125

Smoothed Data Histogram - Detail

Page 26: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833126

Self-Organizing Map ‘Grid’

Source of Smoothed Data Histogram

Numbers indicate most ‘common’ cluster

1         5  

2 3 2   5 6 5

2 2 2 4 5 5 5

7 1 1 1 5 7  

7 8 7 7 7 10 7

7 9 7 7   11 7

  8     7 10 7

Page 27: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833127

Proximity Matrix

Graphically display the relationship between data elements

Usually symmetric, but can be sorted by the strength of relationships

Page 28: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833128

Proximity Matrix and Dendrogram

Page 29: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833129

Summary

Data visualization techniques are extremely important for understanding the KDD process

A balance of simplicity and completeness is important The techniques discussed allow average users to

understand the results of the KDD process Understanding KDD results to be interpreted/trusted

by non-expert users extending the business value If data visualization techniques do not establish a high

level of trust in the KDD process, the process will fail

Page 30: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833130

Future Direction

Significant effort will be spent on improving data visualization techniques in the next few years

– KDD process and data mining are becoming more widespread– Business will expect tools to become more ‘user-friendly’ and

support the varied level of skills Trends are moving to a more interactive mode

– Static reporting techniques (i.e., standard decision trees, standard circle segments) are being replaced

– Interactive techniques (i.e., smoothed data histograms, interactive circle segments and decision tables)

Very interactive data models ‘virtual reality’ are also being considered/proposed

Page 31: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833131

ReferencesPart 1

Ahlberg, C., “Spotfire: An Information Exploration Environment”, ACM SIGMOD Record, Volume 25, Number 4, December 1996

Ankerst, M., et. al., “Visual Classification: An Interactive Approach to Decision Tree Construction”, KDD-99, San Diego, CAAnkerst, M., et. al., “Towards an Effective Cooperation of the User and the Computer for Classification”, KDD’00, Boston,

MA, USAApte C. and Weiss S.M., “Data Mining with Decision Trees and Decision Rules”, Future Generation Computer Systems,

November 1997Arkin, E., et. al., “Decision Trees for Geometric Models”, ACM, 9th Annual Computational Geometry, 5/93/CA, USAde Hann, G., et. al., “Towards Intuitive Exploration Tools for Data Visualization in VR”, VRST’02, November 11-13, 2003,

Hong KongDunham, M. H., Data Mining – Introductory and Advanced Topics, Prentice Hall, 2003.Fekete, J. and Plaisant, C., Excentric Labeling: Dynamic Neighborhood Labeling for Data Visualization, Proceedings of the

Conference on Human factors in Computer Systems (CHI'99), ACM , New YorkFredrikson, A., et. al., “Temporal, Geographical and Categorical Aggregations Viewed through Coordinated Displays: A

Case Study with Highway Incident Data”, NPIVM’99, Kansas City, MO, 1999Goebel, M. and Gruenwald, L., “A Survey of Data Mining and Knowledge Discovery Software Tools”, SIGKDD Explorations,

June 1999.Han, J. and Cersone, N., “RuleViz: A Model for Visualizing Knowledge Discovery Process”, Sixth International Conference

on Knowledge Discovery and Data Mining, 2000Ho, T., et. al., “Visualization Support for a User-Centered KDD Process”, SIGKDD’02, 2002.

Page 32: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833132

ReferencesPart 2

Hsieh, H. and Shipman, F. M. III, “VITE: A Visual Interface Supporting the Direct Manipulation of Structured Data Using Two-Way Mappings”, IUI 2000, New Orleans LA

“Solving Business Problems with IBM DB2 Intelligent Miner”, Presented by DB2 Developer Domain, http://www7b.software.ibm.com/dmdd

Jain, A. K., et. al., “Data Clustering: A Review”, ACM Computing Surveys, Volume 3, Number 3, September 1999Keim, D. A., “Visual Techniques for Exploring Databases”, KDD’97, Newport Beach, CA, 1997Kohavi, R., and Sommerfield, D, “Targeting Business Users with Decision Table Classifiers”, KDD’99, New York City, 1998Kohavi, R., et. al., “Emerging Trends in Business Analytics”, Communications of the ACM, Volume 45, Number 8,

August 2002Liu, B., et. al., “Clustering Through Decision Tree Construction”, CIKM 2000, ACM, McLean VA, 2000Louie, J. Q. and Kraay, T., “Origami: A New Visualization Tool”, KDD-99, San Diego, CAMoret, B. M. E., “Decision Trees and Diagrams”, Computing Surveys, Volume 14, Number 4, December 1982Nuttall, C., "It's a Vision Thing", Financial Times-IT Review , November 12, 2003 Pampalk, E. et. al., “Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps”, Proceeding of

the International Conference on Artificial Neural Networks (ICANN’02), Springer Lecture Notes in Computer Science, Madrid Spain, 2002

Pampalk, E. et. al., “Content-based Organization and Visualization of Music Archives”, Proceeding of the 10th ACM International Conference on Multimedia (MM’02), Juan-les-Pins, France, 2002

Pampalk, E., et. al., “A New Approach to Hierarchical Clustering and Structuring of Data with Self-Organizing Maps”, Intelligent Data Analysis Journal (IDA), Volume 8, Number 2, 2003

Page 33: Visualization for  Classification and Clustering Techniques

Marc René - CSE 833133

ReferencesPart 3

Rauber, A., et. al., “Empirical Evaluation of Clustering Algorithms”, Journal of Information and Organizational Sciences (JIOS), Volume 24, Number 2, 2000

“Finding the Solution to Data Mining – Exploring the Features and Components of Enterprise Miner, Release 4.1 from SAS” SAS White Paper, 2001

See5 - Data Mining Tools, Release 1.9, Rulequest Research 1997-2003Simoff, S. J., “VDM@ECML/PKDD2001: The International Workshop on Visual Data Mining at ECML/PKDD 2001”,

SIGKDD Explorations, Volume 3, Issue 2, 2001Thearling, K., “Understanding Data Mining: It’s All in the Interaction”, DS Star: The On-Line Executive Journal for

Data-Intensive Decision Support”, Volume 1, Number 10, December 9, 1997Thearling, K., et. al., “Visualizing Data Mining Models”, as published in Information Visualization in Data Mining and

Knowledge Discovery, edited by Fayyad, Usama, et. al., Morgan Kaufman, 2001Ward, M. O., “XmdvTool: Integrating Multiple Methods for Visualizing Multivariate Data”, Proceedings of IEEE

Visualization '94 (Washington, DC, 1994).Wishart, D., “Efficient Hierarchical Cluster Analysis for Data Mining and Knowledge Discovery”, Computing Science

and Statistics, Volume 30, 1998.Wishart, D., “ClustanGraphics3 – Interactive Graphics for Cluster Analysis”, Published in: Classification in the

Information Age, Gaul W. and Locarrek-Junge, H (Eds.), Springer 1999XmdvTool Home Page (http://davis.wpi.edu/~xmdv/visualizations.html)