course on data mining (581550-4)

21.11.2001 Data mining: KDD Process 1

Intro/Ass. RulesIntro/Ass. RulesIntro/Ass. RulesIntro/Ass. Rules

EpisodesEpisodesEpisodesEpisodes

Text MiningText MiningText MiningText Mining

Home ExamHome Exam

24./26.10.

30.10.

ClusteringClusteringClusteringClustering

KDD ProcessKDD ProcessKDD ProcessKDD Process

Appl./SummaryAppl./SummaryAppl./SummaryAppl./Summary

14.11.

21.11.

7.11.

28.11.

Course on Data Mining (581550-4)Course on Data Mining (581550-4)


Today 22.11.2001Today 22.11.2001Today 22.11.2001Today 22.11.2001

• Today's subjectToday's subject: :

o KDD ProcessKDD Process

• Next week's programNext week's program: :

o Lecture: Data mining Lecture: Data mining applications, future, summaryapplications, future, summary

o Exercise: KDD ProcessExercise: KDD Process

o Seminar: KDD ProcessSeminar: KDD Process

Course on Data Mining (581550-4)Course on Data Mining (581550-4)


KDD processKDD process

• OverviewOverview

• PreprocessingPreprocessing

• Post-processingPost-processing

• SummarySummary

OverviewOverview


What is KDD? A process!What is KDD? A process!

• Aim:Aim: the selection and processing selection and processing of dataof data for

o the identification of novel, accurate, and useful patterns, and

o the modeling of real-world phenomena

• Data mining is a major Data mining is a major component of the KDD processcomponent of the KDD process


Typical KDD processTypical KDD process

Data miningData miningData miningData miningInput dataInput dataInput dataInput data ResultsResultsResultsResultsPreprocessingPreprocessing PostprocessingPostprocessing

OperationalOperationalDatabaseDatabase

OperationalOperationalDatabaseDatabase

Selection

Selection

Selection

Selection

UtilizationUtilizationUtilizationUtilization

CleanedVerifiedFocused

Eval. ofinteres-tingness

Raw data

Target dataset

Selected usable

patterns

1 32


Learning the domainLearning the domainLearning the domainLearning the domain

Data reduction and projectionData reduction and projectionData reduction and projectionData reduction and projection

Creating a target data setCreating a target data setCreating a target data setCreating a target data set

Data cleaning, integrationData cleaning, integrationand transformationand transformation

Data cleaning, integrationData cleaning, integrationand transformationand transformation

Choosing the DM taskChoosing the DM taskChoosing the DM taskChoosing the DM task

Pre-Pre-processingprocessing

Phases of the KDD process (1)Phases of the KDD process (1)


Choosing the DM algorithm(s)Choosing the DM algorithm(s)Choosing the DM algorithm(s)Choosing the DM algorithm(s)

Knowledge presentationKnowledge presentationKnowledge presentationKnowledge presentation

Data mining: searchData mining: searchData mining: searchData mining: search

Pattern evaluationPattern evaluationand interpretationand interpretationPattern evaluationPattern evaluationand interpretationand interpretation

Use of discovered knowledgeUse of discovered knowledgeUse of discovered knowledgeUse of discovered knowledge

Post-Post-processingprocessing

Phases of the KDD process (2)Phases of the KDD process (2)


Preprocessing - overviewPreprocessing - overview

• Why data preprocessing?Why data preprocessing?

• Data cleaning Data cleaning

• Data integration and Data integration and transformationtransformation

• Data reductionData reduction

PreprocessingPreprocessing


Why data preprocessing?Why data preprocessing?

• Aim: Aim: to select the data relevant with respect to the task in hand to be mined

• Data in the real world is dirtyData in the real world is dirty

o incompleteincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

o noisynoisy: containing errors or outliers

o inconsistentinconsistent: containing discrepancies in codes or names

• No quality data, no quality mining results!No quality data, no quality mining results!


Measures of data qualityMeasures of data quality

o accuracyaccuracy

o completenesscompleteness

o consistencyconsistency

o timelinesstimeliness

o believabilitybelievability

o value addedvalue added

o interpretabilityinterpretability

o accessibilityaccessibility


Preprocessing tasks (1)Preprocessing tasks (1)

• Data cleaningData cleaning

o fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

• Data integrationData integration

o integration of multiple databases, files, etc.

• Data transformationData transformation

o normalization and aggregation



• Data reductionData reduction (including discretizationdiscretization)

o obtains reduced representation in volume, but produces the same or similar analytical results

o data discretization is part of data reduction, but with particular importance, especially for numerical data


Data CleaningData CleaningData CleaningData Cleaning

Data IntegrationData IntegrationData IntegrationData Integration

Data TransformationData TransformationData TransformationData Transformation

Data ReductionData ReductionData ReductionData Reduction



Data cleaning tasksData cleaning tasks

• Fill in missing valuesFill in missing values

• Identify outliers and Identify outliers and smooth out noisy datasmooth out noisy data

• Correct inconsistent Correct inconsistent datadata


Missing DataMissing Data

• Data is not always availableData is not always available

• Missing data may be due toMissing data may be due to

o equipment malfunction

o inconsistent with other recorded data, and thus deleted

o data not entered due to misunderstanding

o certain data may not be considered important at the time of entry

o not register history or changes of the data

• Missing data may need to be inferredMissing data may need to be inferred


How to Handle Missing Data? (1)How to Handle Missing Data? (1)

• Ignore the tupleIgnore the tuple

o usually done when the class label is missing

o not effective, when the percentage of missing values per attribute varies considerably

• Fill in the missing value manuallyFill in the missing value manually

o tedious + infeasible?

• Use a global constant to fill in the missing valueUse a global constant to fill in the missing value

o e.g., “unknown”, a new class?!


How to Handle Missing Data? (2)How to Handle Missing Data? (2)

• Use the attribute mean to fill in the missing valueUse the attribute mean to fill in the missing value

• Use the attribute mean for all samples belonging to Use the attribute mean for all samples belonging to the same class to fill in the missing valuethe same class to fill in the missing value

o smarter solution than using the “general” attribute mean

• Use the most probable value to fill in the missing Use the most probable value to fill in the missing valuevalue

o inference-based tools such as decision tree induction or a Bayesian formalism

o regression


Noisy DataNoisy Data

• NoiseNoise: random error or variance in a measured variable

• Incorrect attribute values may due toIncorrect attribute values may due to

o faulty data collection instruments

o data entry problems

o data transmission problems

o technology limitation

o inconsistency in naming convention


How to Handle Noisy Data?How to Handle Noisy Data?

• BinningBinning

o smooth a sorted data value by looking at the values around it

• ClusteringClustering

o detect and remove outliers

• Combined computer and human inspectionCombined computer and human inspection

o detect suspicious values and check by human

• RegressionRegression

o smooth by fitting the data into regression functions


Binning methods (1)Binning methods (1)

• Equal-depth (frequency) partitioningEqual-depth (frequency) partitioning

o sort data and partition into bins, N intervals, each containing approximately same number of samples

o smooth by bin meansbin means, bin medianbin median, bin bin boundariesboundaries, etc.

o good data scaling

o managing categorical attributes can be tricky


Binning methods (2)Binning methods (2)

• Equal-width (distance) partitioningEqual-width (distance) partitioning

o divide the range into N intervals of equal size: uniform grid

o if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.

o the most straightforward

o outliers may dominate presentation

o skewed data is not handled well


Equal-depth binning - ExampleEqual-depth binning - Example

• Sorted data for priceSorted data for price (in dollars): o 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

• Partition intoPartition into (equal-depth) binsbins:o Bin 1: 4, 8, 9, 15o Bin 2: 21, 21, 24, 25o Bin 3: 26, 28, 29, 34

• Smoothing by bin meansSmoothing by bin means:o Bin 1: 9, 9, 9, 9o Bin 2: 23, 23, 23, 23o Bin 3: 29, 29, 29, 29

• ……by bin boundaries:by bin boundaries:o Bin 1: 4, 4, 4, 15o Bin 2: 21, 21, 25, 25o Bin 3: 26, 26, 26, 34


Data Integration (1)Data Integration (1)

• Data integrationData integration

o combines data from multiple sources into a coherent store

• Schema integrationSchema integration

o integrate metadata from different sources

o entity identification problementity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#


Data Integration (2)Data Integration (2)

• Detecting and resolving data value Detecting and resolving data value conflictsconflicts

o for the same real world entity, attribute values from different sources are different

o possible reasonspossible reasons: different representations, different scales, e.g., metric vs. British units


Handling Redundant DataHandling Redundant Data

• Redundant data occur often, when multiple databases Redundant data occur often, when multiple databases are integratedare integrated

o the same attribute may have different names in different databases

o one attribute may be a “derived” attribute in another table, e.g., annual revenue

• Redundant data may be detected by correlation Redundant data may be detected by correlation analysisanalysis

• Careful integration of data from multiple sources mayCareful integration of data from multiple sources may

o help to reduce/avoid redundancies and inconsistencies

o improve mining speed and quality


Data TransformationData Transformation

• SmoothingSmoothing: remove noise from data

• AggregationAggregation: summarization, data cube construction

• GeneralizationGeneralization: concept hierarchy climbing

• NormalizationNormalization: scaled to fall within a small, specified range, e.g.,

o min-max normalization

o normalization by decimal scaling

• Attribute/feature constructionAttribute/feature construction

o new attributes constructed from the given ones


Data ReductionData Reduction

• Data reductionData reduction

o obtains a reduced representation of the data set that is much smaller in volume

o produces the same (or almost the same) analytical results as the original data

• Data reduction strategiesData reduction strategies

o dimensionality reduction

o numerosity reduction

o discretization and concept hierarchy generation


Dimensionality ReductionDimensionality Reduction

• Feature selectionFeature selection (i.e., attribute subset selection):o select a minimum set of features such that the

probability distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features

o reduce the number of patterns in the patterns, easier to understand

• Heuristic methodsHeuristic methods (due to exponential # of choices):o step-wise forward selectiono step-wise backward eliminationo combining forward selection and backward elimination


Initial attribute set:Initial attribute set: {A1, A2, A3, A4, A5, A6}

> Reduced attribute set: {A1, A4, A6}

Dimensionality Reduction - Dimensionality Reduction - ExampleExample

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2


Numerosity ReductionNumerosity Reduction

• Parametric methodsParametric methods

o assume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)

o e.g., regression analysis, log-linear models

• Non-parametric methods Non-parametric methods

o do not assume models

o e.g., histograms, clustering, sampling


DiscretizationDiscretization

• Reduce the number of values Reduce the number of values for a given continuous attribute continuous attribute by dividing the range of the attribute into intervalsintervals

• Interval labels Interval labels can then be used to replace actual data values

• Some classification algorithms only accept categorical attributes categorical attributes


Concept HierarchiesConcept Hierarchies

• Reduce the data Reduce the data by collecting and replacing low level concepts by higher level concepts higher level concepts

• For example, replace numeric values for the attribute age by more general values young, middle-aged, or senior


Discretization and Discretization and concept hierarchy generation concept hierarchy generation

for numeric datafor numeric data

• BinningBinning

• Histogram analysisHistogram analysis

• Clustering analysis Clustering analysis

• Entropy-based discretizationEntropy-based discretization

• Segmentation by natural Segmentation by natural partitioningpartitioning


Concept hierarchy generation Concept hierarchy generation for categorical datafor categorical data

• Specification of a partial ordering of partial ordering of attributes attributes explicitly at the schema level by users or expertsusers or experts

• Specification of a portion of a a portion of a hierarchy hierarchy by explicit data grouping data grouping

• Specification of a set of attributes, a set of attributes, but not of their partial orderingnot of their partial ordering

• Specification of only a partial set of only a partial set of attributesattributes


Specification of Specification of a set of attributesa set of attributes

• Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy.

country

province_or_ state

city

street

15 distinct values

65 distinct values

3567 distinct values

674 339 distinct values


Post-processing - overviewPost-processing - overview

• Why data post-Why data post-processing?processing?

• InterestingnessInterestingness

• VisualizationVisualization

• UtilizationUtilization

Post-processingPost-processing


Why data post-processing? (1)Why data post-processing? (1)

• Aim:Aim: to show the results, or more precisely the most interesting findings, of the data mining phase to a user/users in an understandable way

• A possible post-processing methodology:A possible post-processing methodology:

o find all potentially interesting patterns according to some rather loose criteria

o provide flexible methods for iteratively and interactively creating different views of the discovered patterns

• Other more restrictive or focused methodologies Other more restrictive or focused methodologies possible as wellpossible as well


Why data post-processing? (2)Why data post-processing? (2)

• A post-processing methodology is useful, ifA post-processing methodology is useful, ifo the desired focus is not known in advance (the

search process cannot be optimizedoptimized to look only for the interesting patterns)

o there is an algorithm that can produce all patterns from a class of potentially interesting patterns (the result is completecomplete)

o the time requirement for discovering all potentially interesting patterns is not considerably longer than, if the discovery was focused to a small subset of potentially interesting patterns


Are all the discovered pattern Are all the discovered pattern interesting?interesting?

• A data mining system/query may generate thousands of patterns, thousands of patterns, but are they all interesting? interesting?

Usually NOT!Usually NOT!

• How could we then choose choose the interesting patterns?

=> InterestingnessInterestingness


Interestingness criteria (1)Interestingness criteria (1)

• Some possible criteria for interestingnessSome possible criteria for interestingness:

o evidence:evidence: statistical significance of finding?

o redundancy:redundancy: similarity between findings?

o usefulness:usefulness: meeting the user's needs/goals?

o novelty:novelty: already prior knowledge?

o simplicity:simplicity: syntactical complexity?

o generality:generality: how many examples covered?


Interestingness criteria(2)Interestingness criteria(2)

• One division of interestingness criteria:One division of interestingness criteria:

o objective measures objective measures that are based on statistics and structures of patterns, e.g., J-measure: statistical significance certainty factor: support or frequency strength: confidence

o subjective measures subjective measures that are based on user’s beliefs in the data, e.g., unexpectedness: “is the found pattern surprising?" actionability: “can I do something with it?"


Criticism: Support & ConfidenceCriticism: Support & Confidence

• Example: (Aggarwal & Yu, PODS98)Example: (Aggarwal & Yu, PODS98)

o among 5000 students 3000 play basketball, 3750 eat cereal 2000 both play basket ball and eat cereal

o the rule play basketball play basketball eat cereal [40%, 66.7%] eat cereal [40%, 66.7%] is misleading, because the overall percentage of students eating cereal is 75%, which is higher than 66.7%

o the rule play basketball play basketball not eat cereal [20%, not eat cereal [20%, 33.3%]33.3%] is far more accurate, although with lower support and confidence


InterestInterest

• Yet anotherYet another objective measureobjective measure for interestingness is interestinterest that is defined as

• Properties of this measure:Properties of this measure:o takes both P(A) and P(B) in consideration:o P(A^B)=P(B)*P(A), if A and B are independent eventso A and B negatively correlated, if the value is less than 1;

otherwise A and B positively correlated.

)()(

)(

BPAP

BAP


J-measureJ-measure• Also J-measure J-measure

is an objective measure objective measure for interestingness• Properties of J-measure:Properties of J-measure:

o again, takes both P(A) and P(B) in considerationo value is always between 0 and 1o can be computed using pre-calculated values

))(1

)(1log()(1(

)(

)(log)()(

Bconf

BAconfBAconf

Bconf

BAconfBAconfAconfmeasureJ


Support/Frequency/J-measureSupport/Frequency/J-measure

0

500

1000

1500

2000

2500

3000

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Threshold

Ru

les

Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

Dataset 6


ConfidenceConfidence

0

500

1000

1500

2000

2500

3000

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Confidence threshold

Ru

les

Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

Dataset 6


Example – Selection of Example – Selection of Interesting Association RulesInteresting Association Rules

• For reducing the number of association rules that have to be considered, we could, for example, use one of the following selection selection criteria:criteria:o frequency and confidenceo J-measure or interesto maximum rule size (whole rule,

left-hand side, right-hand side)o rule attributes (e.g., templates)


Example – Example – Problems with selection of rulesProblems with selection of rules

• A rule can correspond to prior knowledge or A rule can correspond to prior knowledge or expectationsexpectations

o how to encode the background knowledge into the system?

• A rule can refer to uninteresting attributes or A rule can refer to uninteresting attributes or attribute combinationsattribute combinations

o could this be avoided by enhancing the preprocessing phase?

• Rules can be redundantRules can be redundant

o redundancy elimination by rule covers etc.


Interpretation and evaluation of Interpretation and evaluation of the results of data miningthe results of data mining

• EvaluationEvaluation

o statistical validation and significance testing

o qualitative review by experts in the field

o pilot surveys to evaluate model accuracy

• InterpretationInterpretation

o tree and rule models can be read directly

o clustering results can be graphed and tabled

o code can be automatically generated by some systems


Visualization of Visualization of Discovered Patterns (1)Discovered Patterns (1)

• In some cases, visualization of the visualization of the results results of data mining (rules, clusters, networks…) can be very helpful

• Visualization is actually already already important in the preprocessing important in the preprocessing phase phase in selecting the appropriate data or in looking at the data

• Visualization requires training and training and practicepractice


Visualization of Visualization of Discovered Patterns (2)Discovered Patterns (2)

• Different backgrounds/usages may require different Different backgrounds/usages may require different forms of representationforms of representationo e.g., rules, tables, cross-tabulations, or pie/bar chart

• Concept hierarchy is also importantConcept hierarchy is also important o discovered knowledge might be more understandable

when represented at high level of abstraction o interactive drill up/down, pivoting, slicing and dicing

provide different perspective to data• Different kinds of knowledge require different kinds of Different kinds of knowledge require different kinds of

representationrepresentationo association, classification, clustering, etc.


VisualizationVisualization


Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data PresentationVisualization Techniques

Data MiningInformation Discovery

Data Exploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Utilization of the resultsUtilization of the results


SummarySummary

• Data mining:Data mining: semi-automatic discovery of interesting patterns from large data sets

• Knowledge discovery is a Knowledge discovery is a process:process:

o preprocessing

o data mining

o post-processing

o using and utilizing the knowledge


SummarySummary

• Preprocessing Preprocessing is important in order to get useful results!

• If a loosely defined mining methodology is used, post- post-processing processing is needed in order to find the interesting results!

• Visualization Visualization is useful in pre- and post-processing!

• One has to be able to utilize utilize the found knowledge!


References – KDD ProcessReferences – KDD Process

• P. Adriaans and D. Zantinge. Data Mining. Addison-Wesley: Harlow, England, 1996.• R.J. Brachman, T. Anand. The process of knowledge discovery in databases. Advances in

Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. • D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments.

Communications of ACM, 42:73-78, 1999.• M. S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE

Trans. Knowledge and Data Engineering, 8:866-883, 1996. • U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge

Discovery and Data Mining. AAAI/MIT Press, 1996. • T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of

ACM, 39:58-64, 1996.• Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee

on Data Engineering, 20(4), December 1997.• D. Keim, Visual techniques for exploring databases. Tutorial notes in KDD’97, Newport Beach, CA,

USA, 1997.• D. Keim, Visual data mining. Tutorial notes in VLDB’97, Athens, Greece, 1997.• D. Keim, and H.P. Krieger, Visual techniques for mining large databases: a comparison. IEEE

Transactions on Knowledge and Data Engineering, 8(6), 1996.



• W. Kloesgen, Explora: A multipattern and multistrategy discovery assistant. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 249-271. AAAI/MIT Press, 1996.

• M. Klemettinen, A knowledge discovery methodology for telecommunication network alarm databases. Ph.D. thesis, University of Helsinki, Report A-1999-1, 1999.

• M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM’94, Gaithersburg, Maryland, Nov. 1994.

• G. Piatetsky-Shapiro, U. Fayyad, and P. Smith. From data mining to knowledge discovery: An overview. In U.M. Fayyad, et al. (eds.), Advances in Knowledge Discovery and Data Mining, 1-35. AAAI/MIT Press, 1996.

• G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991..

• D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.

• T. Redman. Data Quality: Management and Technology. Bantam Books, New York, 1992.

• A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE Trans. on Knowledge and Data Engineering, 8:970-974, Dec. 1996.

• D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, Seattle, Washington, June 1998.



• Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of ACM, 39:86-95, 1996.

• R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans. Knowledge and Data Engineering, 7:623-640, 1995.


Reminder: Course OrganizationReminder: Course Organization

Course EvaluationCourse EvaluationCourse EvaluationCourse Evaluation

• Passing the course: min 30 pointsPassing the course: min 30 pointso home exam: min 13 points (max 30

points)o exercises/experiments: min 8 points

(max 20 points) at least 3 returned and reported

experimentso group presentation: min 4 points (max

10 points)• Remember also the other requirements:Remember also the other requirements:

o attending the lectures (5/7)o attending the seminars (4/5)o attending the exercises (4/5)


Seminar Presentations/Groups 9-10Seminar Presentations/Groups 9-10

Visualization and Visualization and data miningdata mining

Visualization and Visualization and data miningdata mining

D. Keim, H.P., Kriegel, T. D. Keim, H.P., Kriegel, T. Seidl: “Supporting Data Seidl: “Supporting Data Mining of Large Databases Mining of Large Databases by Visual Feedback by Visual Feedback Queries", ICDE’94.Queries", ICDE’94.


Seminar Presentations/Groups 9-10Seminar Presentations/Groups 9-10

Interestingness Interestingness Interestingness Interestingness

G. Piatetsky-Shapiro, C.J. G. Piatetsky-Shapiro, C.J. Matheus: “The Matheus: “The Interestingness of Interestingness of Deviations”, KDD’94.Deviations”, KDD’94.


Thanks to Thanks to Jiawei Han from Simon Fraser University Jiawei Han from Simon Fraser University

andandMika Klemettinen from Nokia Research CenterMika Klemettinen from Nokia Research Center

for their slides for their slides which greatly helped in preparing this lecture! which greatly helped in preparing this lecture!

Also thanks to Also thanks to Fosca Giannotti and Dino Pedreschi from PisaFosca Giannotti and Dino Pedreschi from Pisa

for their slides.for their slides.

KDD processKDD process

course on data mining (581550-4)

Documents