university of illinois at urbana-champaign 1 analytical and visual data mining michael welge...

53
1 University of Illinois at Urbana- Champaign Analytical and Visual Data Analytical and Visual Data Mining Mining Michael Welge Michael Welge [email protected] [email protected] Automated Learning Group, Automated Learning Group, NCSA NCSA www.ncsa.uiuc.edu/STI/ALG www.ncsa.uiuc.edu/STI/ALG October 14, 1998 October 14, 1998

Post on 22-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

1University of Illinois at Urbana-Champaign

Analytical and Visual Data MiningAnalytical and Visual Data MiningMichael WelgeMichael Welge

[email protected]@ncsa.uiuc.eduAutomated Learning Group, NCSAAutomated Learning Group, NCSA

www.ncsa.uiuc.edu/STI/ALGwww.ncsa.uiuc.edu/STI/ALGOctober 14, 1998October 14, 1998

Page 2: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

2University of Illinois at Urbana-Champaign

Why Data Mining? -- Potential ApplicationsWhy Data Mining? -- Potential Applications

• Database analysis, decision support, and automation– Market and Sales Analysis– Fraud Detection– Manufacturing Process Analysis– Risk Analysis and Management– Experimental Results Analysis– Scientific Data Analysis– Text Document Analysis

Page 3: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

3University of Illinois at Urbana-Champaign

Data Mining: Confluence of Multiple Data Mining: Confluence of Multiple DisciplinesDisciplines

• Database Systems, Data Warehouses, and OLAP• Machine Learning• Statistics• Mathematical Programming• Visualization• High Performance Computing

Page 4: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

4University of Illinois at Urbana-Champaign

Data Mining: On What Kind of Data?Data Mining: On What Kind of Data?

• Relational Databases• Data Warehouses• Transactional Databases• Advanced Database Systems

– Object-Relational

– Spatial

– Temporal

– Text

– Heterogeneous, Legacy, and Distributed

– WWW

Page 5: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

5University of Illinois at Urbana-Champaign

Why Do We Need Data Mining?Why Do We Need Data Mining?

• Leverage organization’s data assets– Only a small portion (typically - 5%-10%) of the

collected data is ever analyzed– Data that may never be analyzed continues to be

collected, at a great expense, out of fear that something which may prove important in the future is missed

– Growth rates of data precludes traditional “manual intensive” approach

Page 6: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

6University of Illinois at Urbana-Champaign

Why Do We Need Data Mining?Why Do We Need Data Mining?

• As databases grow, the ability to support the decision support process using traditional query languages become infeasible– Many queries of interest are difficult to state in a

query language ( Query formulation problem)– “find all cases of fraud”

– “find all individuals likely to buy a FORD Expedition”

– “find all documents that are similar to this customers problem”

Page 7: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

7University of Illinois at Urbana-Champaign

Knowledge Discovery ProcessKnowledge Discovery Process

• Data Mining: is a step in the knowledge discovery process consisting of particular algorithms (methods) that under some acceptable objective, produces a particular enumeration of patterns (models) over the data.

• Knowledge Discovery Process: is the process of using data mining methods (algorithms) to extract (identify) what is deemed knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations.

Page 8: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

8University of Illinois at Urbana-Champaign

Data Mining: A KDD ProcessData Mining: A KDD Process

Page 9: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

9University of Illinois at Urbana-Champaign

Knowledge Discovery Process Application Knowledge Discovery Process Application Domain Domain

First and foremost you must understand your data and your business.

It may be that you wish to increase the response from a direct mail campaign. So do you want to build a model to:– increase the response rate – increase the value of the response

Depending on your specific goal, the model you choose may be different.

Page 10: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

10University of Illinois at Urbana-Champaign

Knowledge Discovery - Selecting DataKnowledge Discovery - Selecting Data

The task of selecting data begins with deciding what data is needed to solve the problem.

Issues:– Database incompatibility– Data may be in an obscure form– Data is incomplete

Page 11: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

11University of Illinois at Urbana-Champaign

Knowledge Discovery - Preparing The DataKnowledge Discovery - Preparing The Data

Data may have to be loaded from legacy systems or external sources, stored, cleaned, and validated.

Issues:– Data may be in a format incompatible for its end use – Data may have many missing, incomplete, or

erroneous values– Field descriptions may be unclear, confusing, or

have different meanings depending on the source– Data may be stale

Page 12: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

12University of Illinois at Urbana-Champaign

Knowledge Discovery - Transforming DataKnowledge Discovery - Transforming Data

Considerable planning and knowledge of your data should go into the transformation decision.

Data transformation are at the heart of developing a sound model.

Page 13: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

13University of Illinois at Urbana-Champaign

Knowledge Discovery - Knowledge Discovery - Types of TransformationsTypes of Transformations

• Feature construction– applying a mathematical formula to existing data

features

• Feature subset selection– removing columns which are not pertinent or redundant,

or contain uninteresting predictors

• Aggregating data– grouping features together and finding sums,

maximums, minimums, or averages

• Bin the data– breaking up continuous ranges into discrete segments

Page 14: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

14University of Illinois at Urbana-Champaign

Knowledge Discovery - Data MiningKnowledge Discovery - Data Mining

The process of building models differ among:– Supervised learning (classification, regression,

time series problems)– Unsupervised learning (database segmentation)– Pattern identification and description (link analysis)

Once you have decided on the model type, you

must choose an method for building the model

(decision tree, neural net, K-nearest neighbor ),

then the algorithm (backpropagation)

Page 15: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

15University of Illinois at Urbana-Champaign

Knowledge Discovery - Analyze and DeployKnowledge Discovery - Analyze and Deploy

Once the model is built, its implications must be understood. Graphical representations of relationships between independent and dependent variables may be necessary. Also, attention should be focused on important aspects of the model such as outliers or value.

Model deployment may mean writing a new application, embedding into an existing system, or applying it to an existing data set. Model monitoring should be established.

Page 16: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

16University of Illinois at Urbana-Champaign

Required Effort for Each KDD StepRequired Effort for Each KDD Step

0

10

20

30

40

50

60

BusinessObjectives

Determination

Data Preparation Data Mining Analysis &Assimilation

Eff

ort

(%

)

Page 17: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

17University of Illinois at Urbana-Champaign

What Data Mining Will Not DoWhat Data Mining Will Not Do

• Automatically find answers to questions you do not ask

• Constantly monitor your database for new and interesting relationships

• Eliminate the need to understand your business and your data

• Remove the need for good data analysis skills

Page 18: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

18University of Illinois at Urbana-Champaign

Data Mining Models and MethodsData Mining Models and Methods

PredictiveModeling

Classification

Value prediction

DatabaseSegmentation

Demographic clustering

Neural clustering

LinkAnalysis

Associations discovery

Sequential pattern discovery

Similar time sequence discovery

DeviationDetection

Visualization

Statistics

Page 19: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

19University of Illinois at Urbana-Champaign

Deviation DetectionDeviation Detection

• identify outliers in a dataset• typical techniques - probability distribution

contrasts, supervised/unsupervised learning • hypothetical example: Point-of-sale fraud

detection

Page 20: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

20University of Illinois at Urbana-Champaign

Fraud and Inappropriate Practice PreventionFraud and Inappropriate Practice Prevention

Background: Through regular review, HR has developed a

collaborative relationship with its Sales Associates (SAs). Semi-annual meetings allow review of the SAs practices with similar SAs across the country.

Goal: The approach is aimed at modifying SAs behavior

to promote better service rather than at investigating and prosecuting SAs, although both strategies are employed.

Page 21: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

21University of Illinois at Urbana-Champaign

Fraud and Inappropriate Practice PreventionFraud and Inappropriate Practice Prevention

Business Objective: The focus of this project was on the recent and

steady 12% annual rise in overrides. The overall business objective of the project was to find a way to ensure that the overrides were appropriate with out negatively affecting service provided by the SAs.

Page 22: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

22University of Illinois at Urbana-Champaign

Fraud and Inappropriate Practice PreventionFraud and Inappropriate Practice Prevention

Approach:

• To identify potential fraudulent overrides or overrides arising from inappropriate practices.

• To develop general profiles of the SAs practices in order to compare practice behavior of individual SAs.

Page 23: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

23University of Illinois at Urbana-Champaign

Fraud and Inappropriate Practice PreventionFraud and Inappropriate Practice Prevention

Page 24: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

24University of Illinois at Urbana-Champaign

Database SegmentationDatabase Segmentation

• regroup datasets into clusters that share common characteristics

• typical technique - unsupervised leaning (SOMs, K-Means)

• hypothetical example: Cluster all similar regimes (financial, free form text)

Page 25: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

25University of Illinois at Urbana-Champaign

Self Organizing Maps Example - Text Self Organizing Maps Example - Text ClusteringClustering

This data is considered to be confidential and proprietary to Caterpillarand may only be used with prior written consent from Caterpillar.

Page 26: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

26University of Illinois at Urbana-Champaign

Predictive ModelingPredictive Modeling

• past data predicts future response• typical technique - supervised learning (Artificial

Neural Networks, Decision Trees, Naïve Bayesian)• hypothetical example (classification): Who is most

likely to respond to a direct mailing• hypothetical example (predication): How will the

German Stock Price Index perform in the next 3, 5, 7, days

Page 27: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

27University of Illinois at Urbana-Champaign

Predictive Modeling - Prior ProbabilitiesPredictive Modeling - Prior Probabilities

Page 28: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

28University of Illinois at Urbana-Champaign

Predictive Modeling - Posterior ProbabilitiesPredictive Modeling - Posterior Probabilities

Page 29: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

29University of Illinois at Urbana-Champaign

Link AnalysisLink Analysis

• relationships between records/attributes in datasets

• typical techniques - rule association, sequence discovery

• hypothetical example (rule association): When people buy a hammer they also buy nails 50% of the time

• hypothetical example ( sequence discovery): When people buy a hammer they also buy nails within the next 3 months 18% of the time, and within the subsequent 3 months 12% of the time

Page 30: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

30University of Illinois at Urbana-Champaign

Link Analysis (Rule Association)Link Analysis (Rule Association)

• Given a database, find all associations of the form:

IF < LHS > THEN <RHS >

Prevalence = frequency of the LHS and RHS occurring together

Predictability = fraction of the RHS out of all items with the LHS

Page 31: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

31University of Illinois at Urbana-Champaign

Rule Association - Basket AnalysisRule Association - Basket Analysis

Page 32: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

32University of Illinois at Urbana-Champaign

Association Rules - Basket AnalysisAssociation Rules - Basket Analysis

• <Dairy-Milk-Refrigerated> implies <Soft Drinks Carbonated>– prevalence = 4.99%, predictability = 22.89%

• <Dry Dinners - Pasta> implies <Soup-Canned>– prevalence = 0.94%, predictability = 28.14%

• <Paper Towels - Jumbo> implies <Toilet Tissue>– prevalence = 2.11%, predictability = 38.22%

• <Dry Dinners - Pasta> implies <Cereal - Ready to Eat>– prevalence = 1.36%, predictability = 41.02%

• <Cheese-Processed Slices - American> implies <Cereal - Ready to Eat>– prevalence = 1.16%, predictability = 38.01%

Page 33: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

33University of Illinois at Urbana-Champaign

Requirements For Successful Data MiningRequirements For Successful Data Mining

• There is a sponsor for the application.• The business case for the application is clearly

understood and measurable, and the objectives are likely to be achievable given the resources being applied.

• The application has a high likelihood of having a significant impact on the business.

• Business domain knowledge is available.• Good quality, relevant data in sufficient quantities

is available.

Page 34: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

34University of Illinois at Urbana-Champaign

Requirements For Successful Data MiningRequirements For Successful Data Mining

• The right people – business domain, data management, and data mining experts. People who have “been there and done that”

For a first time project the following criteria could be added:

• The scope of the application is limited. Try to show results within 3-6 months.

• The data source should be limited to those that are well know, relatively clean and freely accessible.

Page 35: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

35University of Illinois at Urbana-Champaign

Rapid KD Development EnvironmentRapid KD Development Environment

Page 36: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

36University of Illinois at Urbana-Champaign

Rapid KDD Development EnvironmentRapid KDD Development Environment

Page 37: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

37University of Illinois at Urbana-Champaign

Why Information VisualizationWhy Information Visualization

• Gain insight into the contents and complexity of the database being analyzed

• Vast amounts of under utilized data• Time-critical decisions hampered• Key information difficult to find• Results presentation• Reduced perceptual, interpretative, cognitive

burden

Page 38: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

38University of Illinois at Urbana-Champaign

Typical DataTypical Data

• Abstract corporate data• Mostly discrete not continuous• Often multi-dimensional• Quantitative• Text• Historical or real-time

Page 39: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

39University of Illinois at Urbana-Champaign

Typical ApplicationsTypical Applications

• Historical Data Analysis– Marketing Data Mining Analysis– Portfolio Performance Attribution– Fraud/Surveillance Analysis

• Decision Support– Financial Risk Management– Operations Planning– Military Strategic Planning Typical Applications

Page 40: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

40University of Illinois at Urbana-Champaign

Typical Applications (cont)Typical Applications (cont)

• Monitoring Real-Time Status– Industrial Process Control– Capital Markets Trading Management– Network Monitoring

• Management Reporting– Financial Reporting– Sales and Marketing Reporting

Page 41: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

41University of Illinois at Urbana-Champaign

Click on me.. I am an animation

Marketing Data Mining AnalysisMarketing Data Mining Analysis

Page 42: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

42University of Illinois at Urbana-Champaign

Risk ManagementRisk Management

Page 43: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

43University of Illinois at Urbana-Champaign

Capital Markets Trading ManagementCapital Markets Trading Management

Page 44: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

44University of Illinois at Urbana-Champaign

Network MonitoringNetwork Monitoring

Page 45: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

45University of Illinois at Urbana-Champaign

Industrial Process ControlIndustrial Process Control

Page 46: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

46University of Illinois at Urbana-Champaign

Crisis MonitoringCrisis Monitoring

Ground (Student) View Aerial/Oracular (Instructor) View

NormalIgnited Destroyed

ExtinguishedFire Alarm

Flooding

Color code for compartment status

Engulfed

Page 47: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

47University of Illinois at Urbana-Champaign

3D Financial Reporting3D Financial Reporting

Page 48: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

48University of Illinois at Urbana-Champaign

Statistics VisualizerStatistics Visualizer

Page 49: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

49University of Illinois at Urbana-Champaign

Scatter VisualizerScatter Visualizer

Page 50: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

50University of Illinois at Urbana-Champaign

Splat VisualizerSplat Visualizer

Page 51: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

51University of Illinois at Urbana-Champaign

Tree VisualizerTree Visualizer

Page 52: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

52University of Illinois at Urbana-Champaign

Map VisualizerMap Visualizer

Page 53: University of Illinois at Urbana-Champaign 1 Analytical and Visual Data Mining Michael Welge welge@ncsa.uiuc.edu Automated Learning Group, NCSA

53University of Illinois at Urbana-Champaign

Decision TreeDecision Tree